Estimating Reproducibility of Psychology (No. 68): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “Why People Are Reluctant to Tempt Fate” by Risen and Gilovich examined magical thinking in six experiments.  The evidence suggests that individuals are reluctant to tempt fate because it increases the accessibility of thoughts about negative outcomes. The article has been cited 58 times so far and it was cited 10 times in 2017, although the key finding failed to replicate in the OSC (Science, 2015) replication study.

Risen.png

Study 1

Study 1 demonstrated the basic phenomenon.  62 students read a scenario about a male student who applied at a prestigious university.  His mother sent him a t-shirt with the logo of the university. In one condition, he decided to wear the t-shirt. In the other scenario, he stuffed it in the bottom drawer.  Participants rated how likely it would be that the student would be accepted.  Participants thought it would be more likely that the student gets accepted, if the student did not wear the t-shirt  (wearing it M = 5.19, SD = 1.35; stuffed away M = 6.13, SD = 1.02), t(60) = 3.01, p = .004, d = 0.78.

Study 2

120 students participated in Study 2 (n = 30 per cell). Study 2 manipulated whether participants imagined themselves or somebody else in a scenario. The scenario was about the probability of a professor picking a student to answer a question.  The experimental factor was whether students had done the reading or not. Not having done the reading was considered tempting fate.

The ANOVA results showed a significant main effect for tempting fate (not prepared M =  3.43, SD = 2.34; prepared M = 2.53, SD = 2.24), F(1, 116) = 4.60, p = .034. d = 0.39.

Study 3

Study 3 examined whether tempting fate increases the accessibility of thoughts about negative outcomes with 211 students.  Accessibiliy was measured with reaction times to two scenarios matching those from Study 1 and 2.  Participants had to indicate as quickly as possible whether the ending of a story matched the beginning of a story.

Analysis were carried out separately for each story.  Participants were faster to judge that not getting into a prestigious university was a reasonable ending after reading that a student tempted fate by wearing a t-shirt with the university logo  (wearing t-shirt M =  2,671 ms, SD = 1,113) than those who read that he stuffed the shirt in the drawer
(M = 3,176 ms, SD = 1,573), F(1, 171) = 11.01, p = .001, d = 0.53.

The same result was obtained for judgments of tempting fate by not doing the readings for a class, (not prepared M = 2,879 ms, SD = 1,149; prepared M = 3,112 ms, SD 1,226), F(1, 184) = 7.50, p = .007, d = 0.26.

Study 4 

Study 4 aimed to test the mediation hypothesis. Notably the sample size is much smaller than in Study 3 (N = 96 vs. N = 211).

The study used the university application scenario. For half the participants the decision was acceptance and for the other half it was rejection.

The reaction time ANOVA showed a significant interaction, F(1, 87) = 15.43.

As in Study 3, participants were faster to respond to a rejection after wearing the shirt than after not wearing it (wearing M = 3,196 ms, SD = 1,348; not wearing M = 4,324 ms,
SD = 2,194), F(1, 41) = 9.13, p = .004, d = 0.93.   Surprisingly, the effect size was twice as large as in Study 3.

The novel finding was that participants were faster to respond to an acceptance decision after not wearing the shirt than after wearing it (not wearing M = 2,995 ms, SD = 1,175;  wearing M = 3,551 ms, SD = 1,432),  F(1, 45) = 6.07, p = .018, d = 0.73.

Likelihood results also showed a significant interaction, F(1, 92) = 10.49, p = .002.

As in Study 2, in the rejection condition participants believed that a rejection was more likely after wearing the shirt than after putting it away (M = 5.79, SD = 1.53; M = 4.79, SD = 1.56), t(46) = 2.24, p = .030, d = 0.66.  In the new acceptance condition, participants thought that an acceptance was less likely after wearing the shirt than after putting it away (wore shirt M = 5.88, SD = 1.51;  did not wear shirt M = 6.83, SD = 1.31), t(46) = 2.35, p = .023, d = 0.69.  [The two p-values are surprisingly similar]

The mediation hypothesis was tested separately for the rejection and acceptance condition.  For the rejection condition, the Sobel test was significant, z = 1.96, p = .05. For the acceptance condition, the result was considered to be “supported by a marginally significant Sobel (1982) test, z = 1.91, p = .057.  [It is unlikely that two independent statistical tests produce p-values of .05 and .057]

Study 5

Study 5 is the icing on the cake. It aimed to manipulate accessibility by means of a subliminal priming manipulation.  [This was 2008 when subliminal priming was considered a plausible procedure]

Participants were 111 students.

The main story was about a woman who did or did not (tempt fate) bring an umbrella when the forecast predicted rain.  The ending of the story was that it started to rain hard.

For the reaction times, the interaction between subliminal priming and the manipulation of tempting fate (the protagonist brought an umbrella or not) was significant, F(1, 85) = 5.89.

In the control condition with a nonsense prime, participants were faster to respond to the ending that it would rain, if the protagonist did not bring an umbrella than when she did (no umbrella M = 2,694 ms, SD = 876; umbrella M = 3,957 ms, SD = 2,112), F(1, 43) =
15.45, p = .0003, d = 1.19.  This finding conceptually replicated studies 3 and 4.

In the priming condition, no significant effect of tempting fate was observed (no umbrella M = 2,749 ms, SD = 971, umbrella M = 2,770 ms, SD = 1,032).

For the likelihood judgments, the interaction was only marginally significant, F(1,
86) = 3.62, p = .06.

However, in the control condition with nonsense primes, the typical tempt fate effect was significant (no umbrella M = 6.96, SD = 1.31; M = 6.15, SD = 1.46), t(44) = 2.00, p = .052 (reported as p = .05), d = 0.58.

The tempt fate effect was not observed in the priming condition when participants were subliminally primed with rain (no umbrella M = 7.11, SD = 1.56; M = 7.16, SD = 1.41).

As in Study 5, “the mediated relation was supported by a marginally significant Sobel
(1982) test, z = 1.88, p = .06.  It is unlikely to get p = .05, p = .06 and p  = .06 in three independent mediation tests.

Study 6

Study 6 is the last study and the study that was chosen for the replication attempt.

122 students participated.  Study 6 used the scenario of being called on by a professor either prepared or not prepared (tempting fate).  The novel feature was a cognitive load manipulation.

The interaction between load manipulation and tempting fate manipulation was significant, F(1, 116) = 4.15, p = .044.

The no-load condition was a replication of Study 2 and replicated a significant effect of tempting fate (not prepared (M = 2.93, SD = 2.16, prepared M = 1.90, SD = 1.42), t(58) = 2.19, p = .033, d = 0.58.

Under the load condition, the effect was even more pronounced (not prepared M = 5.27, SD = 2.36′ prepared M = 2.70, SD = 2.17), t(58) = 4.38, p = .00005, d = 1.15.

A comparison of participants in the tempting fate condition showed a significant difference between the load and the no-load condition, t(58) = 3.99, p = .0002, d = 0.98.

Overall the results suggest that some questionable research practices were used (e.g., mediation tests p = .05, .06, .06).  The interaction effect in Study 6 with the load condition was also just significant and may not replicate.  However, the main effect of the tempting fate manipulation on likelihood judgments was obtained in all studies and might replicate.

Replication Study 

The replication study used an Mturk sample. The sample size was larger than in the original study (N = 226 vs. 122).

The load manipulation lead to higher likelihood estimates of being called on, suggesting that the load manipulation was effective even with Mturk participants, F(1,122) = 10.28.

However, the study did not replicate the interaction effect, F(1, 122) = 0.002.  More surprisingly, it also failed to show a main effect for the tempting-fate manipulation, F(1,122) = 0.50, p = .480.

One possible reason for the failure to replicate the tempting fate effect in this study could be the use of a school/university scenario (being called on by a professor) with Mturk participants who are older.

However, the results for the same scenario in the original article are not very strong.

In Study 2, the p-value was p = .034 and in the the no-load condition in Study 6 the p-value was p = .033.  Thus, neither the interaction with load, nor the main effect of the tempting fate manipulation are strongly supported in the original article.

Conclusion

It is never possible to show definitively that QRPs were used, it is possible that the use of QRPs in the original article explain the replication failure, although other explanations are also possible.  The most plausible alternative explanation would be the use of an Mturk sample.  A replication study in a student sample or a replication study of one of the other scenarios would be desirable.

 

 

 

 

 

 

 

 

 

Advertisements

Klaus Fiedler’s Response to the Replication Crisis: In/actions speaks louder than words

Klaus Fiedler  is a prominent experimental social psychologist.  Aside from his empirical articles, Klaus Fiedler has contributed to meta-psychological articles.  He is one of several authors of a highly cited article that suggested numerous improvements in response to the replication crisis; Recommendations for Increasing Replicability in Psychology (Asendorpf, Conner, deFruyt, deHower, Denissen, K. Fiedler, S. Fiedler, Funder, Kliegel, Nosek, Perugini, Roberts, Schmitt, vanAken, Weber, & Wicherts, 2013).

The article makes several important contributions.  First, it recognizes that success rates (p < .05) in psychology journals are too high (although a reference to Sterling, 1959, is missing). Second, it carefully distinguishes reproducibilty, replicabilty, and generalizability. Third, it recognizes that future studies need to decrease sampling error to increase replicability.  Fourth, it points out that reducing sampling error increases replicabilty because studies with less sampling error have more statistical power and reduce the risk of false negative results that often remain unpublished.  The article also points out problems with articles that present results from multiple underpowered studies.

“It is commonly believed that one way to increase replicability is to present multiple studies. If an effect can be shown in different studies, even though each one may be underpowered, many readers, reviewers, and editors conclude that it is robust and replicable. Schimmack (2012), however, has noted that the opposite can be true. A study with low power is, by definition, unlikely to obtain a significant result with a given effect size.” (p. 111)

If we assume that co-authorship implies knowledge of the content of an article, we can infer that Klaus Fiedler was aware of the problem of multiple-study articles in 2013. It is therefore disconcerting to see that Klaus Fiedler is the senior author of an article published in 2014 that illustrates the problem of multiple study articles (T. Krüger,  K. Fiedler, Koch, & Alves, 2014).

I came across this article in a response by Jens Forster to a failed replication of Study 1 in Forster, Liberman, and Kuschel, 2008).  Forster cites the Krüger et al. (2014) article as evidence that their findings have been replicated to discredit the failed replication in the Open Science Collaboration replication project (Science, 2015).  However, a bias-analysis suggests that Krüger et al.’s five studies had low power and a surprisingly high success rate of 100%.

No N Test p.val z OP
Study 1 44 t(41)=2.79 0.009 2.61 0.74
Study 2 80 t(78)=2.81 0.006 2.73 0.78
Study 3 65 t(63)=2.06 0.044 2.02 0.52
Study 4 66 t(64)=2.30 0.025 2.25 0.61
Study 5 170 t(168)=2.23 0.027 2.21 0.60

z = -qnorm(p.val/2);  OP = observed power  pnorm(z,1.96)

Median observed power is only 61%, but the success rate (p < .05) is 100%. Using the incredibility index from Schimmack (2012), we find that the binomial probability of obtaining at least one non-significant result with median power of 61% is 92%.  Thus, the absence of non-significant results in the set of five studies is unlikely.

As Klaus Fiedler was aware of the incredibility index by the time this article was published, the authors could have computed the incredibility of their results before they published the results (as Micky Inzlicht blogged “check yourself, before you wreck yourself“).

Meanwhile other bias tests have been developed.  The Test of Insufficient Variance (TIVA) compares the observed variance of p-values converted into z-scores to the expected variance of independent z-scores (1). The observed variance is much smaller,  var(z) = 0.089 and the probability of obtaining such small variation or less by chance is p = .014.  Thus, TIVA corroberates the results based on the incredibility index that the reported results are too good to be true.

Another new method is z-curve. Z-curve fits a model to the density distribution of significant z-scores.  The aim is not to show bias, but to estimate the true average power after correcting for bias.  The figure shows that the point estimate of 53% is high, but the 95%CI ranges from 5% (all 5 significant results are false positives) to 100% (all 5 results are perfectly replicable).  In other words, the data provide no empirical evidence despite five significant results.  The reason is that selection bias introduces uncertainty about the true values and the data are too weak to reduce this uncertainty.

Fiedler4

The plot also shows visually how unlikely the pile of z-scores between 2 and 2.8 is. Given normal sampling error there should be some non-significant results and some highly significant (p < .005, z > 2.8) results.

In conclusion, Krüger et al.’s multiple-study article cannot be used by Forster et al. as evidence that their findings have been replicated with credible evidence by independent researchers because the article contains no empirical evidence.

The evidence of low power in a multiple study article also shows a dissociation between Klaus Fiedler’s  verbal endorsement of the need to improve replicability as co-author of the Asendorpf et al. article and his actions as author of an incredible multiple-study article.

There is little excuse for the use of small samples in Krüger et al.’s set of five studies. Participants in all five studies were recruited from Mturk and it would have been easy to conduct more powerful and credible tests of the key hypotheses in the article. Whether these tests would have supported the predictions or not remains an open question.

Automated Analysis of Time Trends

It is very time consuming to carefully analyze individual articles. However, it is possible to use automated extraction of test statistics to examine time trends.  I extracted test statistics from social psychology articles that included Klaus Fiedler as an author. All test statistics were converted into absolute z-scores as a common metric of the strength of evidence against the null-hypothesis.  Because only significant results can be used as empirical support for predictions of an effect, I limited the analysis to significant results (z >  1.96).  I computed the median z-score and plotted them as a function of publication year.

Fiedler.png

The plot shows a slight increase in strength of evidence (annual increase = 0.009 standard deviations), which is not statistically significant, t(16) = 0.30.  Visual inspection shows no notable increase after 2011 when the replication crisis started or 2013 when Klaus Fiedler co-authored the article on ways to improve psychological science.

Given the lack of evidence for improvement,  I collapsed the data across years to examine the general replicability of Klaus Fiedler’s work.

Fiedler2.png

The estimate of 73% replicability suggests that randomly drawing a published result from one of Klaus Fiedler’s articles has a 73% chance of being replicated if the study and analysis was repeated exactly.  The 95%CI ranges from 68% to 77% showing relatively high precision in this estimate.   This is a respectable estimate that is consistent with the overall average of psychology and higher than the average of social psychology (Replicability Rankings).   The average for some social psychologists can be below 50%.

Despite this somewhat positive result, the graph also shows clear evidence of publication bias. The vertical red line at 1.96 indicates the boundary for significant results on the right and non-significant results on the left. Values between 1.65 and 1.96 are often published as marginally significant (p < .10) and interpreted as weak support for a hypothesis. Thus, the reporting of these results is not an indication of honest reporting of non-significant results.  Given the distribution of significant results, we would expect more (grey line) non-significant results than are actually reported.  The aim of reforms such as those recommended by Fiedler himself in the 2013 article is to reduce the bias in favor of significant results.

There is also clear evidence of heterogeneity in strength of evidence across studies. This is reflected in the average power estimates for different segments of z-scores.  Average power for z-scores between 2 and 2.5 is estimated to be only 45%, which also implies that after bias-correction the corresponding p-values are no longer significant because 50% power corresponds to p = .05.  Even z-scores between 2.5 and 3 average only 53% power.  All of the z-scores from the 2014 article are in the range between 2 and 2.8 (p < .05 & p > .005).  These results are unlikely to replicate.  However, other results show strong evidence and are likely to replicate. In fact, a study by Klaus Fiedler was successfully replicated in the OSC replication project.  This was a cognitive study with a within-subject design and a z-score of 3.54.

The next Figure shows the model fit for models with a fixed percentage of false positive results.

Fiedler3.png

Model fit starts to deteriorate notably with false positive rates of 40% or more.  This suggests that the majority of published results by Klaus Fiedler are true positives. However, selection for significance can inflate effect size estimates. Thus, observed effect sizes estimates should be adjusted.

Conclusion

In conclusion, it is easier to talk about improving replicability in psychological science, particularly experimental social psychology, than to actually implement good practices. Even prominent researchers like Klaus Fiedler have responsibilities to their students to publish as much as possible.  As long as reputation is measured in terms of number of publications and citations, this will not change.

Fortunately, it is now possible to quantify replicability and to use these measures to reward research that require more resources to provide replicable and credible evidence without the use of questionable research practices.  Based on these metrics, the article by Krüger et al. is not the norm for publications by Klaus Fiedler and Klaus Fiedler’s replicability index of 73 is higher than the index of other prominent experimental social psychologists.

An easy way to improve it further would be to retract the weak T. Krüger et al. article. This would not be a costly retraction because the article has not been cited in Web of Science so far (no harm, no foul).  In contrast, the Asendorph et al. (2013) article has been cited 245 times and is Klaus Fiedler’s second most cited article in WebofScience.

The message is clear.  Psychology is not in the year 2010 anymore. The replicability revolution is changing psychology as we speak.  Before 2010, the norm was to treat all published significant results as credible evidence and nobody asked how stars were able to report predicted results in hundreds of studies. Those days are over. Nobody can look at a series of p-values of .02, .03, .049, .01, and .05 and be impressed by this string of statistically significant results.  Time to change the saying “publish or perish” to “publish real results or perish.”

 

Estimating Reproducibility of Psychology (No. 64): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

Article 68 “The Effect of Global Versus Local Processing Styles on Assimilation
Versus Contrast in Social Judgment”  is no ordinary article.  The first author, Jens Forster, has been under investigation for scientific misconduct and it is not clear whether published results in some articles are based on real or fabricated data.  Some articles that build on the same theory and used similar methods as this article have been retracted.  Scientific fraud would be one reason why an original study cannot be replicated.

Summary of Original Article

The article uses the first author’s model of global/local processing style model (GLOMO) to examine assimilation and contrast effects in social judgment. The article reports five experiments that showed that processing styles elicited in one task can carry over to other tasks and influence social judgments.

Study 1 

This study was chosen for the replication project.

Participants were 88 students.  Processing styles were manipulated by projecting a city map on a screen and asking participants to either (a) focus on the broader shape of the city or (b) to focus on specific details on the map. The study also included a control condition.  This task was followed by a scrambled sentence task with neutral or aggressive words.  The main dependent variable were aggression ratings in a person perception task.

With 88 participants and six conditions, there are n = 13 participants per condition.

The ANOVA results showed a highly significant interaction between the processing style and priming manipulations, F(2, 76) = 21.57, p < .0001.

We can think about the 2 x 3 design as three priming experiments for each of the three processing style conditions.

The global condition shows a strong assimilation effect, (prime M = 6.53, SD =1.21; no prime M =  4.15, SD = 1.25), t(26) = 5.10, p = .000007, d = 1.94.

In the control processing condition, priming shows an assimilation effect (priming (M = 5.63, SD = 1.25) than after nonaggression priming (M = 4.29, SD =1.23), t(25) = 2.79,  p = .007, d = 1.08.

The local processing condition shows a significant contrast effect (M = 2.86, SD = 1.15) than after nonaggression priming
(M = 4.62; SD = 1.16), t(25) = 3.96, p = .0005, d = -1.52.

Although the reported results appear to provide strong evidence, the extremely large effect sizes raise concern about the reported results.  After all, these are not the first studies that have examined priming effects on person perception.  The novel contribution was to demonstrate that these effects change (are moderated) as a function of processing styles.  What is surprising is that processing styles also appear to have magnified the typical effects without any theoretical explanation for this magnification.

The article was cited by Isbell, Rovenpor, and Lair (2016) because they used the map manipulation in combination with a mood manipulation. The article reports a significant interaction between processing and mood, F(1,73) = 6.33, p = .014.  In the global condition, more abstract statements in an open ended task were recorded in the angry mood condition, but the effect was not significant and much smaller than in Forster’s studies, F(1,73) = 3.21, p = .077, d = .55.  In the local condition, sad participants listed more abstract statements, but again the effect was not significant and smaller than in Forster et al.’s studies, F(1,73) = 3.20, p = .078, d = .67.  As noted before, these results are also questionable because it is unlikely to get p = .077 and p = .078 in two independent statistical tests.

In conclusion, the effect sizes reported by Foerster et al. in Study 1 are unbelievable because they are much larger than can be expected.

Study 2

Study 2 was a replication and extension of a study by Mussweiler and Strack (2000). Participants were 124 students from the same population.  This study used a standard processing style manipulation (Navon, 1977) that presented global letters composed of several smaller different letters (the letter E made up of several n).  The main dependent variable were judgments of drug use.   The design had 2 between subject factors: 3 (processing styles) x 2 (high vs. low comparison standard). Thus, there were about 20 to 21 participants per condition.  The study also had a within-subject factor (subjective vs. objective rating).

The ANOVA shows a 3-way interaction, F(2, 118) = 5.51, p = .005.

Once more, the 3 x 2 design can be treated as 3 independent studies of comparison standards. Because subjective and objective ratings are not independent, I focus on the objective ratings that produced stronger effects.

In the global condition, the high standard produced higher reports of drug use than the low standard (M = 0.66, SD = 1.13 vs. M = -0.47, SD = 0.57), t(39) = 4.04, p = .0004, d = 1.26.

In the control condition, a similar pattern was observed but it was not significant (M = 0.07, SD = 0.79 vs. M = -0.45, SD = 0.98), t(39) = 1.87, p = .07, d = 0.58.

In the local condition, the pattern is reversed (M = -0.41, SD = 0.83 vs. M = 0.60, SD = 0.99), t(39) = 3.54, p = .001, d = -1.11.

As the basic paradigm was a replication of Mussweiler and Strack’s (2000) Study 4, it is possible to compare the effect sizes in this study with the effect size in the original study.   The effect size in the original study was d = .31; 95%CI = -0.24, 1.01.  The effect is not significant, but the interaction effect for objective and subjective judgments was, F(1,30) = 4.49, p = .04.  The effect size is comparable to the control condition, but the  effect sizes for the global and local processing conditions are unusually large.

Study 3

132 students from the same population took part in Study 3.  This study was another replication and extension of Mussweiler and Strack (2000).  In this study, participants made ratings of their athletic abilities.  The extension was to add a manipulation of time (imagine being in an athletic competition today or in one year).  The design was a 3 (temporal distance: distant future vs. near future vs. control) by 2 (high vs. low standard) BS design with objective vs. subjective ratings as a within factor.

The three-way interaction was significant, F(2, 120) = 4.51, p = .013.

In the distant future condition,  objective ratings were higher with the high standard than with the low standard (high  M = 0.56, SD  = 1.04; low M = -0.58, SD = .51), t(41)  =
4.56, p = .0001, d = 1.39.

In the control condition,  objective ratings of athletic ability were higher after the high standard than after the low standard (high M = 0.36, SD = 1.08; low M = -0.36, SD = 0.77), t(38) = 2.44, p = .02, d = 0.77.

In the near condition, the opposite pattern was reported (high M = -0.35, SD = 0.33, vs. low M = 0.36, SD = 1.29), t(41) = 2.53; p = .02,  d = -.75.

In the original study by Mussweiler and Strack the effect size was smaller and not significant (high M = 5.92, SD = 1.88; low M = 4.89, SD = 2.37),  t(34) =  1.44, p = .15, d = 0.48.

Once more the reported effect sizes by Forster et al. are surprisingly large.

Study 4

120 students from the same population participated in Study 4.  The main novel feature of Study 4 was the inclusion of a lexical decision task and the use of reaction times as the dependent variable.   It is important to realize that most of the variance in lexical decision tasks is random noise and fixed individual differences in reaction times.  This makes it difficult to observe large effects in between-subject comparisons and it is common to use within-subject designs to increase statistical power.  However, this study used a between-subject design.  The ANOVA showed the predicted four-way interaction, F(1,108) = 26.17.

The four way interaction was explained by a 3-way interaction for self-primes F(1, 108)  = 39.65,, and no significant effects with control primes.

For moderately high standards, reaction times to athletic words were slower after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

For unathletic words, the revers pattern was observed.

For moderately high standards, reaction times were faster after local processing than after global processing (local M = 695, SD = 163, global M = 589, SD = 77), t(28) = 2.28, p = .031, d = 0.83.

For moderately low standards, reaction times to athletic words were faster after local processing than after global processing (local M = 516, SD = 61, global M = 643, SD = 172), t(28) = 2.70, p = .012, d = -0.98.

In sum, Study 4 reported reaction time differences as a function of global versus local processing styles that were surprisingly large.

Study 5

Participants in Study 5 were 128 students.  The main novel contribution of Study 5 was the inclusion of a line-bisection task that is supposed to measure asymmetries in brain activation.  The authors predicted that local processing induces more activation of the left-side of the brain and global processing induces more activation of the right side of the brain.  The comparisons of the local and global condition with the control condition showed the predicted mean differences, t(120) = 1.95, p = .053 (reported as p = .05) and t(120) = 2.60, p = .010.   Also as predicted, the line-bisection measure was a significant mediator, z = 2.24, p = .03.

The Replication Study 

The replication project called for replication of the last study, but the replication team in the US found that it was impossible to do so because the main outcome measure of Study 5 was alcohol consumption and drug use (just like Study 2) and pilot studies showed that incidence rates were much lower than in the German sample.  Therefore the authors replicated the aggression priming study of Study 1.

The focal test of the replication study was the interaction between processing condition and priming condition. As noted earlier, this interaction was very strong,  F(2, 76) = 21.57, p < .0001, and therefore seemingly easy to replicate.

Fortunately, the replication team dismissed the outcome of a post-hoc power analysis, which suggested that only 32 participants would be needed and used the same sample size as the original study.

The processing manipulation was changed from a map of the German city of Oldenburg to a state map of South Carolina.  This map was provided by the original authors. The replication report emphasizes that “all changes were endorsed by the first author of the original study.”

The actual sample size was a bit smaller (N = 74) and after exclusion of 3 suspicious participants data analyses were based on 71 (vs. 80 in original study) participants.

The ANOVA failed to replicate a significant interaction effect, F(2, 65) = .865, p =
.426.

The replication study also included questions about the effectiveness of the processing style manipulation.  Only 32 participants indicated that they followed instructions.  Thus, one possible explanation for the replication failure is that the replication study did not successfully manipulate processing styles. However, the original study did not include a similar question and it is not clear why participants in the original study were more compliant.

More troublesome, is that the replication study did not replicate the simple priming effect in the control condition or the global condition, which should have produced the effect with or without successful manipulation of processing styles.

In the control condition, the mean was lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.27, SD = 1.29, neutral M = 7.00, SD = 1.30), t(22) = 1.38, p = .179, d = -.56.

In the global condition, the mean was also lower in the aggression prime condition than in the neutral prime condition (aggression M = 6.38, SD = 1.75, neutral M = 7.23, SD = 1.46), t(22) = 1.29, p = .207, d = -.53.

In the local condition, the means were nearly identical (aggression M = 7.77, SD = 1.16, neutral M = 7.67, SD = 1.27), t(22) = 0.20, p = .842, d = .08.

The replication report points out that the priming task was introduced by Higgins, Rholes, and Jones (1977).   Careful reading of this article shows that the original article also did not show immediate effects of priming.  The study obtained target ratings immediately and 10 to 14 days later.  The ANOVA showed a significant interaction with time, F(1,36) = 4.04, p = .052 (reported as p < .05).

“A further analysis of the above Valence x Time interaction indicated that the difference in evaluation under positive and negative conditions was small and nonsignificant on the immediate measure (M = .8 and .3 under positive and negative conditions, respectively), t(38)= 0.72, p > .25 two-tailed; but was substantial and significant on the delayed measure.” (Higgins et al., 1977).

Conclusion

There are serious concerns about the strong effects in the original article by Forster et al. (2008).   Similar results have raised concerns about data collected by Jens Forster. Although investigations have yielded no clear answers about the research practices, some articles have been retracted (Retraction Watch).  Social priming effects have also proven to be difficult to replicate (R-Index).

The strong effects reported by Forster et al. are not the result of typical questionable research practices that result in just significant results.  Thus, statistical methods that predict replicability falsely predict that Forster’s results would be easy to replicate and only actual replication studies or forensic analysis of original data might be able to reveal that reported results are not trustworthy.  Thus, statistical predictions of replicability are likely to overestimate replicability because they do not detect all questionable practices or fraud.

 

 

 

 

 

 

 

 

 

 

Estimating Reproducibility of Psychology (No. 43): An open, post-publication review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

This post examines the reproducibilty of article No. 43 “The rejection of moral rebels: Resenting those who do the right thing”  by Monin, Sawyer, and Marquez (JPSP, 2008).

Abstract:

Four studies document the rejection of moral rebels. In Study 1, participants who made a counterattitudinal speech disliked a person who refused on principle to do so, but uninvolved observers preferred this rebel to an obedient other. In Study 2, participants taking part in a racist task disliked a rebel who refused to go along, but mere observers did not. This rejection was mediated by the perception that rebels would reject obedient participants (Study 3), but did not occur when participants described an important trait or value beforehand (Study 4). Together, these studies suggest that rebels are resented when their implicit reproach threatens the positive self-image of individuals who did not rebel.

The main conclusion of the article is that moral rebels are resented by those who hold the same moral values, but did not rebel.

Four social psychological experiments are used to provide empirical support for this hypothesis.  Thus, the original article already contains one original study and three replication studies.

Study 1

Study 1 used the induced compliance paradigm (Galinsky, Stone, & Cooper, 2000; Zanna & Cooper, 1974).  Presumably, earlier studies that used this paradigm showed that it is effective in inducing dissonance by having participants write an essay that goes against their own beliefs. Importantly, participants are not forced to do so, but merely comply to a request.

An analogy could be a request by an editor to remove a study with imperfect results from an article. The scientists has the option to reject this request because it violates her sense of scientific standards, but she may also comply to get a publication.  This internal conflict is called cognitive dissonance.

After the experimental manipulation of cognitive dissonance, participants were asked to make personality ratings of another participant who refused to comply to the request (rebel) or who did not (compliant).   There was also a group of control participants without the dissonance induction.   This creates a 2 x 2 between-subject (BS) design with induction (yes/no) and rebel target (yes, no) as predictor variables.  The outcome were personality and liking ratings of the targets.

There were 70 participants, but 10 were eliminated because they were suspicious.  So, there were 60 participants for 4 experimental conditions (average cell size n = 15).

The data analysis revealed a significant cross-over interaction, F(1, 56) = 11.00.   Follow-up analysis showed that observers preferred the rebel (M = 0.90, SD = 0.87) to the obedient target (M = 0.07, SD = 1.51), t(56) = 2.22, p = .03, d = 0.72.

Most important, actors preferred the obedient other (M = 0.59, SD = 0.70) to the rebel (M = 0.50, SD = 1.21), t(56) = 2.46, p = .02, d = 1.19.

Although this reported result appears to support the main hypothesis, the results of Study 1 raise concerns about the way the results in Study 1 were obtained.  The reason is that a 2 x 2 BS design essentially combines two experiments in one.  One experiment is conducted without dissonance induction. The other experiment is conducted with dissonance induction.  For each experiment, the data are analyzed and the results showed p = .03 and p = .02.  The problem is that it is unlikely to obtain two p-values that are so similar in two independent studies.  We can quantify the probability of this event under the null-hypothesis that a researchers simply conducted two studies and found two similar p-values just by chance using the Test of Insufficient Variance (TIVA).

Tiva converts p-values into z-values. The sampling distribution of z-values has a variance of 1.  So, we can compare the observed variance of the z-values corresponding to the p-values to an expected variance of 1.  For p = .03 and .02, the observed variance is Var(z) = 0.025.  With one degree of freedom, the probabilty of observing a variance of 0.025 or less is, pchisq(0.025,1) = .125.  This means we would expect two just significant results in independent post-hoc tests in about 1 out of 8 attempts.   This means the results in Study 1 are unusual or surprising, but clearly this does not mean that they are not just a chance finding.

Study 2

Study 2 tested whether the results from Study 1 generalize to racism.  Participants were asked to imagine that they are a detective and they were given three suspects who might have committed a burglary.  The information implicated the African American suspect. Participants all chose the African American target.  For the rebel manipulation participants were asked to make ratings of somebody who had made the same choice or a rebel.  The rebel refused to identify the African American suspect. “I refuse
to make a choice here—this task is obviously biased. . . . Offensive to make black man the obvious suspect. I refuse to play this game.”

56 participants took part in the study, but some had to be excluded (e.g. picked a White suspect), leaving  49 participants for the data analysis (n = 12 per cell).

The interaction effect in the ANOVA was again significant. F(1, 45) = 4.38, p = .04.

The test of the key hypothesis showed the expected mean differences, but it was not significant, (obedient: M = 0.50, SD = 1.34; rebel M = -0.67, SD = 2.03), t(45) = 1.68, p = .10, d = 0.71.  However, p-values in the range between .10 and .05 are often called marginally significant and interpreted as support for a hypothesis.  Thus, Study 2 appears to provide weak support for the main claim of the article.

The comparison of rebels and obedient targets in the observer condition was also not significant, ( obedient target M = 0.12, SD = 1.71; the rebel (M = 0.98, SD = 1.64; t(45) = 1.27, p = .21, d = 0.53.

TIVA for all four p-values from Study 1 and Study 2 shows more variance, var(z) = 0.39, but with 3 df, it is even less likely to observe such similar p-values in four independent studies, p = .033.

Most important, the OSC replication study replicated the detective task and Study 2 failed to show a significant effect for this task.  Thus, obtaining a non-significant result in the OSC replication study is not entirely surprising because the original study also reported a non-significant result.

Study 3 

Study 3 is a replication and extension of Study 2. It replicated the detective task and it extended Study 2 by testing mediation.   The benefit of testing mediation was that the sample size increased to 132 participants (all male).

Study 3 modified four aspects of Study 2.

1. Study 2 asked observers to do the task themselves after rating the target. This serves no purpose and was dropped from Study 3.

2. Participants were explicitly informed that the target they were rating was a White male.

3. Third, the study used only male participants (not clear why).

4. Fourth, the study included additional questions to test mediation. These questions were asked after the ratings of the target and therefore also do not change the experiment form Study 2.

So the only difference was that participants were only males and they were told that they were rating the personality of a White male rebel or conformist.

Two participants expressed suspicion and 13 picked a White suspect, reducing the final sample size to N = 117.

The results were as expected. The interaction was significant, F(1, 113) = 5.58, p = .02. More important, the follow-up test showed that participants in the dissonance condition preferred the conformist  (M = 1.63, SD = 1.15) to a rebel (M = 0.53, SD = 2.27), t(113) = 2.33, p = .02, d = 0.61.  There was no significant difference for observers, t(113) = .98, p = .33, d = 0.27.

Even with the non-significant p-value of 0.33,  the variance of p-values across the 3 studies remains unusually low, var(z) = 0.35,  p = .049.   It is also surprising that the much larger sample size in Study 3 did not produce stronger evidence for the main hypothesis.

Study 4

Study 4 is crucial because this is the study that the OSC project attempted to replicate.  The focus is on Study 4 because the OSC sampling plan asked to focus on the last study.  One problem with this sampling approach is that the last study may be different from studies in a multiple study article.

Like Study 2, Study 4 used male (N = 52) and female (N = 27) participants.  The novel contribution of Study 4 was the addition of a third condition called affirmation.  The study did not include a control condition.  For the replication part of the study, 19 participants judged an obedient target, and 29 judged a rebel.

The results showed a significant interaction effect, F(2,64) = 10.17, p = .0001.  The difference in ratings of obedient and rebel targets was significant and large, t(48) = 3.24, p = .001, d = .96.   The difference was even larger in comparison to the self-affirmation condition, t(48) = 4.39, p = .00003, d = 1.30.

REPLICATION STUDY

The replication study was carried out by Taylor Holubar.  The authors use the strong results of Study 4 to conduct a power analysis and concluded that they needed only n =  18 participants per cell (N = 54) in total to have 95% power to replicate the results of Study 4.   This power analysis overlooks that the replication part of Study 4 produced larger effect sizes than the previous two studies.  Even without this concern, it is questionable to use observed results to plan replication studies because observed effects are influenced by sampling error.  A between-subject study should have a minimum of n = 20 participants per condition (Simmons et al., 2011).  There is also no reason to reduce sample sizes when the replication study is conducted on Mturk, which makes it possible to recruit large samples quickly.

Another concern is that a replication study on Mturk may produce different results than a study with face to face contact between an experimenter and a participant.

The initial Mturk sample consisted of 117 participants. After exclusion of participants for various reasons, the final sample size was N = 75, thus higher than the power analysis suggested.  Nevertheless, the study failed to replicate the significant ANOVA result of the original study., F(2,72)=1.97, p = .147.  This finding was used to consider the finding of the replication study a failure.

However, the comparison of the obedient and rebel condition showed the difference that was observed in the original article and the effect size was similar to the effect size in Study 2 and 3 (obedient M = 0.98, SD = 1.20; rebel M = 0.27, SD = 1.72), t(48) = 1.69, p = .097.

The result falls short of the criterion for statistical significance, but the problem is that the replication study had low power. The power analysis of the replication study used an unusually large effect size in Study 4.

RESPONSE TO REPLICATION ATTEMPT

Monin wrote a response to the replication failure.  Monin pointed out that he was not consulted and never approved of the replication design.  Monin also points out that consultation would have been easy because the replication author and he were both at Stanford.

Monin expresses serious concerns about the self-affirmation manipulation in the replication study.  “The methods differed in important ways from the original lab study (starting with transferring it online), yet the replicators describe their methods as “virtually identical to the original…The self affirmation manipulation was changed from an 8-minute-in-lab-paper-and-pencil essay to a short online question.”

Given this criticism, it seems problematic to consider the failure to produce a self-affirmation effect as crucial for a successful replication.  The key finding in the article was that moral rebels are rated less favorable by individuals in the same situation who comply.  While the replication study failed to show a significant effect for this test as well, this was partially due to the reliance on the unusually strong effect size in Study 4 to plan the sample size for Study 4.  At least it has to be noted that the replication study did not have 95% power as the authors assumed.

Prediction of Replication Outcome

The strong result in Study 4 alone leads to the prediction of a successful replication outcome (Replicability Index = 0.74:  > .50 = success more likely than failure). However, taking the results of  Study 2 and 3 into account leads to the prediction that an exact replication with similar sample size would not replicate.

Obs.Power Success Inflation R-Index
Study 4 0.87 1.00 0.13 0.74
Study 3 0.63 1.00 0.37 0.26
Study 2 0.50* 1.00* 0.50 0.00
Combined 0.63 1.00 0.37 0.26

using p < .10 as criterion because marginally significant result was treated as success.

Conclusion

Neither the original results nor the replication study are flawless.  The original article reported results that are unlikely without the use of some questionable research practices to produce just significant results.  The replication study failed to replicate the effect, but a slightly larger sample might have moved produced a significant result.  It would not be surprising, if another replication study with N = 100 (n = 50 per cell; 80% power with d = .4) would produce a significant result.

At the same time, the key hypothesis of the article remains to be demonstrated.  Moreover, the results are limited to a single paradigm with a hypothetical decision in a detective game. Hopefully, future studies can learn from this discussion of the original and replication study to plan better studies that can produce more conclusive evidence.

Another interesting question could be how moral rebels evaluate targets who are obedient or rebels. My prediction is that rebels will show a preference for rebels and a dislike of obedient individuals.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Shit Social Psychologists Say: “Hitler had High Self-Esteem”

A popular twitter account is called “Shit Academics Say” and it often posts funny commentaries on academia (see example above).
 
I borrow the phrase “Shit Academics Say ” for this post about shit social psychologists say with a sense of authority and superiority.  Social psychologists see themselves as psychological “scientists,”  who study people and therefore believe that they know people better than you or me. However, often their claims are not based on credible scientific evidence and are merely personal opinions disguised as science.
 
For example, a popular undergraduate psychology textbook claims that “Hitler had high self-esteem.” quoting an article that has been cited over 500 times in the journal “Psychological Science in the Public Interest” (although the title suggests it is written for the general public, it is mostly read by psychologists and the title is supposed to create the illusion that they are actually doing important work that serves the public interest.
 
At the end of the article with the title “Does High Self-Esteem Cause Better Performance, Interpersonal Success, Happiness, or Healthier Lifestyles?” the authors write: 
 
“High self-esteem feels good and fosters initiative. It may still prove a useful tool to promote success and virtue, but it should be clearly and explicitly linked to desirable behavior. After all, Hitler had very high self-esteem and plenty of initiative, too, but those were hardly guarantees of ethical behavior.”
 
In the textbook this quote is linked to boys who engage in sex at an “inappropriately young age” which is not further specified (in Canada this would be 14) according to recent statistics). 
 
“High self-esteem does have some benefits—it fosters initiative, resilience, and pleasant feelings (Baumeister & others, 2003). Yet teen males who engage in sexual activity at an “inappropriately young age” tend to have higher than average self-esteem. So do teen gang leaders, extreme ethnocentrists, terrorists, and men in prison for committing violent crimes (Bushman & Baumeister, 2002; Dawes, 1994, 1998). “Hitler had very high self-esteem,” note Baumeister and co-authors (2003).”  (Myers, 2011, Social Psychology, 12th edition)
 
Undergraduate students pay (if they pay; hopefully they do not) $200 to be informed that people with high self-esteem are like sexually deviants, terrorists, violent criminals, and Hitler. (maybe we should add scientists with flair to the list).
 
The problem is that this is not even true. Students who work with me on fact checking the textbook found this quote in the original article.
 
“There was no [!] significant difference in self-esteem scores between violent offenders and non-offenders, Ms = 28.90 and 28.89, respectively, t(7653) = 0.02, p > .9, d = 0.0001.”
[Technical detail you can skip. Although the df of the t-test look impressive, the study compared 63 violent offenders to 7590 unmatched, mostly undergraduate student (gender not specified, probably mostly female) participants. So the sampling error of this study is high and the theoretical importance of comparing these two groups is questionable.
 
How Many Correct Citations Could be False Positives? 
Of course, the example above is an exception.  Most of the time a cited reference contains an empirical finding that is consistent with the textbook claim.  However, this does not mean that textbook findings are based on credible and replicable evidence.  Even a Noble Laureate was conned by flashy findings in small samples that could not be replicated (Train Wreck: Fast & Slow). 
Until recently it was common to assume that statistical significance ensures that most published results are true positives (i.e, not a false positive random finding).  However, this is only the case if all results are reported. It has been known since 1959 that this is not the case in psychology (Sterling, 1959). Psychologists selectively publish only results that support their theories.  This practice disables the significance-filter that is supposed to keep false positives out of the literature.  The claim that results published in social psychology journals  were obtained with rigorous research (Crandall et al., 2018) is as bogus as Volkswagen’s Diesel tests, and the future of social psychology may be as bleak as the future of Diesel engines.  
Jerry Brunner and I developed a statistical tool that can be used to clean up the existing literature. Rather than actually redoing 50 years of research, we use the statistical results reported in original studies to apply a significance filter post-hoc.  Our tool is called zcurve.   Below I used zcurve to examine the replicability of studies that were used in the chapter that also included the comparison of sexually active teenagers with violent criminals, terrorists, and Hitler.
Chapter2.Self
More detailed information about the interpretation of the graph above is provided elsewhere (link).  In short, for each citation in the textbook chapter that is used as evidence for a claim, a team of undergraduate students retrieved the cited article and extracted the main statistical result that matches the textbook claim.  These statistical results are then converted into a z-score that reflects the strength of evidence for a claim.  Only significant results are important because non-significant results cannot support an empirical claim (although sometimes non-significant results are falsely used to support claims that there is no effect).
Zcurve fits a model to the (density) distribution of significant z-scores (z-scores > 1.96).  The shape of the density distribution provides information about the probability that a randomly drawn study from the set would replicate (i.e., reproduce a significant result).  The grey line shows the predicted distribution by zcurve. It matches the observed density in dark blue well. Simulation studies show good performance of zcurve. Zcurve estimates that the average replicability of studies in this chapter is  56%. This number would be reassuring if all studies had 56% power.  This would mean that all studies are true positives and if a study were replicated every other study would be successful.
However, reality does not match this rosy scenario.  In reality, studies vary in replicability.  Studies with z-scores greater than 5 have 99% replicability (see numbers below x-axis).  However, studies with just significant results (z < 2.5) have only 21% replicability.  As you can see, there are a lot more studies with z < 2.5 than studies with z > 5.  So there are more studies with low replicability than studies with high replicability.
The next plot shows model fit (higher numbers = worse fit) for zcurve models with a fixed proportion of false positives.  If the data are inconsistent with a fixed proportion of false positives, model fit decreases (higher numbers).  Chapter2.Self.model.fitpng
The graph shows that models with 100%, 90% or 80% false positives clearly do not fit the data as well as models with fewer false positives.  This shows that some textbook claims are based on solid empirical evidence.   However, model fit for models with 0% to 60% look very similar.  Thus, it is possible that the majority of claims in the self chapter of this textbook are false positives.
It is even more problematic that textbook claims are often based on a single study with a student sample at one university.  Social psychologists have warned repeatedly that their findings are very sensitive to minute variations in studies, which makes it difficult to replicate these effects even under very similar conditions (Van Bavel et al., 2016), and that it is impossible to reproduce exactly the same experimental conditions (Stroebe and Strack, 2014).  Thus, the zcurve estimate of 56% replicability is a wildly optimistic estimate of replicability in actual replication studies. In fact, the average replicability of studies in social psychology is only 25% (Open Science Collaboration, 2015).
Conclusion
While social psychologists are currently outraged about a psychologist with too many self-citations, they are silent about the crimes against science that have been committed by social psychologists that produced pseudo-scientific comparisons of sexually active teenagers with Hitler and questionable claims that suggest high self-esteem is a sign of pathology. Maybe social psychologists should spend less time criticizing others and spend more time reflecting on their own errors. 
In official statements and editorials, social psychologists are taking the right talk.

SPSP recently published a statement on scientific progress which began “Science advances largely by correcting errors, and scientific progress involves learning from mistakes. By eliminating errors in methods and theories, we provide a stronger evidentiary basis for science that allows us to better describe events, predict what will happen, and solve problems” (SPSP Board of Directors, 2016).  [Cited from Crandall et al., 2018, PSPB, Editorial]

However, they are still not walking the walk.  Seven years ago, Simmons et al. (2011) published an article called psychology “False Positive Psychology” that shocked psychologists and raised concerns about the credibilty of textbook findings. One year later, Nobel Laureate Daniel Kahneman wrote an open letter to star social psychologist John Bargh to clean up social psychology. Nothing happened. Instead, John Bargh published a popular book in 2017 that does not mention any of the concerns about the replicabilty of social psychology in general or his work in particular.  Denial is no longer acceptable. It is time to walk the walk and to get rid of pseudo-science in journals and in textbooks.
Hey its spring. What better time to get started with a major house cleaning.

Visual Inspection of Strength of Evidence: P-Curve vs. Z-Curve

Statistics courses often introduce students to a bewildering range of statistical test.  They rarely point out how test statistics are related.  For example, although t-tests may be easier to understand than F-tests, every t-test could be performed as an F-test and the F-value in the F-test is simply the square of the t-value (t^2 or t*t).

At an even more conceptual level, all test statistics are ratios of the effect size (ES) and the amount of sampling error (ES).   The ratio is sometimes called the signal (ES) to noise (ES) ratio.  The higher the signal to noise ratio (ES/SE), the stronger the observed results deviate from the hypothesis that the effect size is zero.  This hypothesis is often called the null-hypothesis, but this terminology has created some confusing.  It is also sometimes called the nil-hypothesis the zero-effect hypothesis or the no-effect hypothesis.  Most important, the test-statistic is expected to average zero if the same experiment could be replicated a gazillion times.

The test statistics of statistical tests cannot be directly compared.  A t-value of 2 in a study with N = 10 participants provides weaker evidence against the null-hypothesis than a z-score of 1.96.  and an F-value of 4 with df(1,40) provides weaker evidence than an F(10,200) = 4 result.  It is only possible to compare test values directly that have the same sampling distribution (z with z, F(1,40) with F(1,40), etc.).

There are three solutions to this problem. One solution is to use effect sizes as the unit of analysis. This is useful if the aim is effect size estimation.  Effect size estimation has become the dominant approach in meta-analysis.  This blog post is not about effect size estimation.  I just mention it because many readers may be familiar with effect size meta-analysis, but not familiar with meta-analysis of test statistics that reflect the ratio of effect size and sampling error (Effect size meta-analysis: unit = ES; Test Statistic Meta-Analysis: unit ES/SE).

P-Curve

There are two approaches to standardize test statistics so that they have a common unit of measurement.  The first approach goes back to Ronald Fisher, who is considered the founder of modern statistics for researchers.  Following Fisher it is common practice to convert test-statistics into p-values (for this blog post assumes that you are familiar with p-values).   P-values have the same meaning independent of the test statistic that was used to compute them.   That is, p = .05 based on a z-test, t-test, or an F-test provide equally strong evidence against the null-hypothesis (Bayesians disagree, but that is a different story).   The use of p-values as a common metric to examine strength of evidence (evidential value) was largely forgotten, until Simonsohn, Simmons, and Nelson (SSN) used p-values to develop a statistical tool that takes publication bias and questionable research practices into account.  This statistical approach is called p-curve.  P-curve is a family of statistical methods.  This post is about the p-curve plot.

A p-curve plot is essentially a histogram of p-values with two characteristics. First, it only shows significant p-values (p < .05, two-tailed).  Second, it plots the p-values between 0 and .05 with 5 bars.  The Figure shows a p-curve for Motyl et al.’s (2017) focal hypothesis tests in social psychology.  I only selected t-test and F-tests from studies with between-subject manipulations.

p.curve.motyl

The main purpose of a p-curve plot is to examine whether the distribution of p-values is uniform (all bars have the same height).  It is evident that the distribution for Motyl et al.’s data is not uniform.  Most of the p-values fall into the lowest range between 0 and .01. This pattern is called “rigth-skewed.”  A right-skewed plot shows that the set of studies has evidential value. That is, some test statistics are based on non-zero effect sizes.  The taller the bar on the left is, the greater the proportion of studies with an effect.  Importantly, meta-analyses of p-values do not provide information about effect sizes because p-values take effect size and sampling error into account.

The main inference that can be drawn from a visual inspection of a p-curve plot is how unlikely it is that all significant results are false positives; that is, the p-value is below .05 (statistically significant), but this strong deviation from 0 was entirely due to sampling error, while the true effect size is 0.

The next Figure also shows a plot of p-values.  The difference is that it shows the full range of p-values and that it differentiates more between p-values because p = .09 provides weaker evidence than p = .0009.

all.p.curve.motyl.png

The histogram shows that most p-values are below p < .001.  It also shows very few non-significant results.  However, this plot is not more informative than the actual p-curve plot. The only conclusion that is readily visible is that the distribution is not uniform.

The main problem with p-value plots is that p-values do not have interval scale properties.  This means, the difference between p = .4 and p = .3 is not the same as the difference between p = .10 and p = .00 (e.g., .001).

Z-Curve  

Stouffer developed an alternative method to Fisher’s p-value meta-analysis.  Every p-value can be transformed into a z-scores that corresponds to a particular p-value.  It is important to distinguish between one-sided and two-sided p-values.  The transformation requires the use of one-sided p-values, which can be obtained by simply dividing a two-sided p-value by 2.  A z-score of -1.96 corresponds to a one-sided p-value of 0.025 and a z-score of 1.96 corresponds to a one-sided p-values of 0.025.  In a two sided test, the sign no longer matters and the two p-values are added to yield 0.025 + 0.025 = 0.05.

In a standard meta-analysis, we would want to use one-sided p-values to maintain information about the sign.  However, if the set of studies examines different hypothesis (as in Motyl et al.’s analysis of social psychology in general) the sign is no longer important.   So, the transformed two-sided p-values produce absolute (only positive) z-scores.

The formula in R is Z = -qnorm(p/2)   [p = two.sided p-value]

For very strong evidence this formula creates problems. that can be solved by using the log.P=TRUE option in R.

Z = -qnorm(log(p/2), log.p=TRUE)

p.to.z.transformation.png

The plot shows the relationship between z-scores and p-values.  While z-scores are relatively insensitive to variation in p-values from .05 to 1, p-values are relatively insensitive to variation in z-scores from 2 to 15.

only.sig.p.to.z.transformation

The next figure shows the relationship only for significant p-values.  Limiting the distribution of p-values does not change the fact that p-values and z-values have very different distributions and a non-linear relationship.

The advantage of using (absolute) z-scores is that z-scores have ratio scale properties.  A z-score of zero has real meaning and corresponds to the absence of evidence for an effect; the observed effect size is 0.  A z-score of 2 is twice as strong as a z-score of 1. For example, given the same sampling error the effect size for a z-score of 2 is twice as large as the effect size for a z-score of 1 (e.g., d = .2, se = .2, z = d/se = 1,  d = 4, se = .2, d/se = 2).

It is possible to create the typical p-curve plot with z-scores by selecting only z-scores above z = 1.96. However, this graph is not informative because the null-hypothesis does not predict a uniform distribution of z-scores.   For z-values the central tendency of z-values is more important.  When the null-hypothesis is true, p-values have a uniform distribution and we would expect an equal number of p-values between 0 and 0.025 and between 0.025 and 0.050.   A two-sided p-value of .025 corresponds to a one-sided p-value of 0.0125 and the corresponding z-value is 2.24

p = .025
-qnorm(log(p/2),log.p=TRUE)
[1] 2.241403

Thus, the analog to a p-value plot is to examine how many significant z-scores fall into the region from 1.96 to 2.24 versus the region with z-values greater than 2.24.

z.curve.plot1.png

The histogram of z-values is called z-curve.  The plot shows that most z-values are in the range between 1 and 6, but the histogram stretches out to 20 because a few studies had very high z-values.  The red line shows z = 1.96. All values on the left are not significant with alpha = .05 and all values on the right are significant (p < .05).  The dotted blue line corresponds to p = .025 (two tailed).  Clearly there are more z-scores above 2.24 than between 1.96 and 2.24.  Thus, a z-curve plot provides the same information as a p-curve plot.  The distribution of z-scores suggests that some significant results reflect true effects.

However, a z-curve plot provides a lot of additional information.  The next plot removes the long tail of rare results with extreme evidence and limits the plot to z-scores in the range between 0 and 6.  A z-score of six implies a signal to noise ratio of 6:1 and corresponds to a p-value of p = 0.000000002 or 1 out of 2,027,189,384 (~ 2 billion) events. Even particle physics settle for z = 5 to decide that an effect was observed if it is so unlikely for a test result to occur by chance.

> pnorm(-6)*2
[1] 1.973175e-09

Another addition to the plot is to include a line that identifies z-scores between 1.65 and 1.96.  These z-scores correspond to two-sided p-values between .05 and .10. These values are often published as weak but sufficient evidence to support the inference that a (predicted) effect was detected. These z-scores also correspond to p-values below .05 in one-sided tests.

z.curve.plot2

A major advantage of z-scores over p-values is that p-values are conditional probabilities based on the assumption that the null-hypothesis is true, but this hypothesis can be safely rejected with these data.  So, the actual p-values are not important because they are conditional on a hypothesis that we know to be false.   It is like saying, I would be a giant if everybody else were 1 foot tall (like Gulliver in Lilliput), but everybody else is not 1 foot tall and I am not a giant.

Z-scores are not conditioned on any hypothesis. They simply show the ratio of the observed effect size and sampling error.  Moreover, the distribution of z-scores tell us something about the ratio of the true effect sizes and sampling error.  The reason is that sampling error is random and like any random variable has a mean of zero.  Therefore, the mode, median, or mean of a z-curve plot tells us something about ratio of the true effect sizes and sampling error.  The more the center of a distribution is shifted to the right, the stronger is the evidence against the null-hypothesis.  In a p-curve plot, this is reflected in the height of the bar with p-values below .01 (z > 2.58), but a z-curve plot shows the actual distribution of the strength of evidence and makes it possible to see where the center of a distribution is (without more rigorous statistical analyses of the data).

For example, in the plot above it is not difficult to see the mode (peak) of the distribution.  The most common z-values are between 2 and 2.2, which corresponds to p-values of .046 (pnorm(-2.2)*2) and .028 (pnorm(-2.2)*2).   This suggests that the modal study has a ratio of 2:1 for effect size over sampling error.

The distribution of z-values does not look like a normal distribution. One explanation for this is that studies vary in sampling errors and population effect sizes.  Another explanation is that the set of studies is not a representative sample of all studies that were conducted.   It is possible to test this prediction by trying to fit a simple model to the data that assumes representative sampling of studies (no selection bias or p-hacking) and that assumes that all studies have the same ratio of population effect size over sampling error.   The median z-score provides an estimate of the center of the sampling distribution.  The median for these data is z = 2.56.   The next picture shows the predicted sampling distribution of this model, which is an approximately normal distribution with a folded tail.

 

z.curve.plot3

A comparison of the observed and predicted distribution of z-values shows some discrepancies. Most important is that there are too few non-significant results.  This observation provides evidence that the results are not a representative sample of studies.  Either non-significant results were not reported or questionable research practices were used to produce significant results by increasing the type-I error rate without reporting this (e.g., multiple testing of several DVs, or repeated checking for significance during the course of a study).

It is important to see the difference between the philosophies of p-curve and z-curve. p-curve assumes that non-significant results provide no credible evidence and discards these results if they are reported.  Z-curve first checks whether non-significant results are missing.  In this way, p-curve is not a suitable tool for assessing publication bias or other problems, whereas even a simple visual inspection of z-curve plots provides information about publication bias and questionable research practices.

z.curve.plot4.png

The next graph shows a model that selects for significance.  It no longer attempts to match the distribution of non-significant results.  The objective is only to match the distribution of significant z-values.  You can do this by hand and simply try out different values for the center of the normal distribution.  The lower the center, the more z-scores are missing because they are not significant.  As a result, the density of the predicted curve needs to be adjusted to reflect the fact that some of the area is missing.

center.z = 1.8  #pick a value
z = seq(0,6,.001)  #create the range of z-values
y = dnorm(z,center.z,1) + dnorm(z,-center.z,1)  # get the density for a folded normal
y2 = y #duplicate densities
y2[x < 1.96] = 0   # simulate selection bias, density for non-significant results is zero
scale = sum(y2)/sum(y)  # get the scaling factor so that area under the curve of only significant results is 1.
y = y / scale   # adjust the densities accordingly

# draw a histogram of z-values
# input is  z.val.input
# example; z.val.input = rnorm(1000,2)
hist(z.val.input,freq=FALSE,xlim=c(0,6),ylim=c(0,1),breaks=seq(0,20,.2), xlab=””,ylab=”Density”,main=”Z-Curve”)

abline(v=1.96,col=”red”)   # draw the line for alpha = .05 (two-tailed)
abline(v=1.65,col=”red”,lty=2)  # draw marginal significance (alpha = .10 (two-tailed)

par(new=TRUE) #command to superimpose next plot on histogram

# draw the predicted sampling distribution
plot(x,y,type=”l”,lwd=4,ylim=c(0,1),xlim=c(0,6),xlab=”(absolute) z-values”,ylab=””)

Although this model fits the data better than the previous model without selection bias, it still has problems fitting the data.  The reason is that there is substantial heterogeneity in the true strength of evidence.  In other words, the variability in z-scores is not just sampling error but also variability in sampling errors (some studies have larger samples than others) and population effect sizes (some studies examine weak effects and others examine strong effects).

Jerry Brunner and I developed a mixture model to fit a predicted model to the observed distribution of z-values.  In a nutshell the mixture model has multiple (folded) normal distributions.  Jerry’s z-curve lets the center of the normal distribution move around and give different weights to them.  Uli’s z-curve uses fixed centers one standard deviation apart (0,1,2,3,4,5 & 6) and uses different weights to fit the model to the data.  Simulation studies show that both methods work well.  Jerry’s method works a bit better if there is little variability and Uli’s method works a bit better with large variability.

The next figure shows the result for Uli’s method because the data have large variability.

z.curve.plot5

The dark blue line in the figure shows the density distribution for the observed data.  A density distribution assigns densities to an observed distribution that does not fit a mathematical sampling distribution like the standard normal distribution.   We use the Kernel Density Estimation method implemented in the R base package.

The grey line shows the predicted density distribution based on Uli’s z-curve method.  The z-curve plot makes it easy to see the fit of the model to the data, which is typically very good.  The model result of the model is the weighted average of the true power that corresponds to the center of the simulated normal distributions.  For this distribution,  the weighted average is 48%.

The 48% estimate can be interpreted in two ways.  First, it means that if researchers randomly sampled from the set of studies in social psychology and were able to exactly reproduce the original study (including sample size),  they have a probability of 48% to replicate a significant result with alpha = .05.  The complementary interpretation is that if researchers were successful in replicating all studies exactly,  the reproducibility project is expected to produce 48% significant results and 52% non-significant results.  Because average power of studies predicts the success of exact replication studies, Jerry and I refer to the average power of studies that were selected for significance replicability.  Simulation studies show that our z-curve methods have good large sample accuracy (+/- 2%) and we adjust for the small estimation bias in large samples by computing a conservative confidence interval that adds 2% to the upper limit and 2% to the lower limit.

Below is the R-Code to obtain estimates of replicability from a set of z-values using Uli’s method.

<<<Download Zcurve R.Code>>>

Install R.Code on your computer, then run from anywhere with the following code

location =  <user folder>  #provide location information where z-curve code is stored
source(paste0(location,”fun.uli.zcurve.sharing.18.1.R”))  #read the code
run.zcurve(z.val.input)  #get z-curve estimates with z-values as input

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

On Power Posing, Power Analysis, Publication Bias, Peer-Review, P-curve, Pepsi, Porsche, and why Psychologists Hate Zcurve

Yesterday, Forbes Magazine was happy to tell its readers that “Power Posing Is Back: Amy Cuddy Successfully Refutes Criticism.”   I am not blaming a journalist for making a false claim that has been published in the premier journal of the American Psychological Society (APS).  My should a journalist be responsible for correcting all the errors that reviewers and the editor who are trained psychological scientists missed.  Rather, the Forbes article highlights that APS is more concerned about the image of psychological science than the scientific validity of data and methods that are used to distinguish opinions from scientific facts.

This blog post shows first of all that power posing researchers have used questionable research practices to produce way more significant results than the weak effects and small samples justify.  Second, it shows that the meta-analysis used a statistical method that is flawed and overestimates evidence in favor of power-posing effects.

Finally, I point out how a better statistical tool shows that the power-posing literature does not provide credible evidence for replicable effects and cannot be used to make bold claims about the effectiveness of power posing as a way to enhance confidence and performance in stressful situations.

Before I start, I need to make something very clear. I don’t care about power posing or Amy Cuddy and I have a track record of attacking powerful men with flair; not power-posing women (so far, I have been threatened with legal actions by 4 men and 1 women). So, this post has nothing to do with gender.  The primary goal is to show problems with the statistical method that Cuddy and colleagues used. This is not their fault.  The method has been heavily advertised, although it has never been subjected to peer-review or published in a peer-reviewed journal.  On top of this, Jerry Brunner and I have tried to publish our criticism of this method since 2016 and journals have rejected our manuscript because this finding was of insufficient importance.  The main point of this blog post is to show that the p-curve meta-analysis by Cuddy et al. is problematic because they used p-curve (which was developed by three men, not a woman). If Cuddy et al. had used z-curve, their conclusions would have been different.
Before I start, I also need to declare a conflict of interest. I am the co-inventor of z-curve and if I am right that z-curve is a lot better than p-curve, it would enhance my reputation and I might benefit from this in the future (so far, I have only suffered from rejections). So, I am presenting my case with a clear bias in favor of zcurve. Whether my arguments are strong or weak, is for the reader to decide.
P-Curve of Power-Posing
The main p-curve analysis includes 53 studies. 11 studies had a non-significant result and were not used because p-curve (and z-curve) only use significant results.  The Figure shows the pcurve results that were reported in the article.
Pcurve.Power.Posing.png
The key finding is this figure is the reported average power estimate of 44% with a 90%CI ranging from 23% to 63%.
The figure also includes the redundant information that the 90%CI implies that we can safely reject the null-hypothesis (Null of no effect, p < .0001), but not the hypothesis that average power is 33%. After all, 33 falls within the lower bound of 23 and the upper bound of 63 of the 90%CI (p > alpha = 1 – .90 = .10).
The p-curve histogram shows that p-values below .01 are the most frequent p-values compared to p-values for the other four segments in the histogram. This visual information is also redundant with the information that average power is above .05 because any set of studies with average power that is not close to alpha (5%) will show this pattern.
These results are used to support the claim that power-posing has real effects (see quotes below).
Conclusions.Power.Posing.png
Z-Curve and Power-Posing
Z-curve produces different results and leads to different conclusions.  The results that I present here were included in a manuscript that was submitted to a new journal by the American Psychological Society (APS) that also publishes Psychological Science, in which Cuddy’s p-curve analysis was reported.  One of the reviewers was a co-author of p-curve, who has just as much a conflict of interest in favor of p-curve as I have in favor of z-curve.  So a biased peer-review is partially responsible for the fact that my criticism of p-curve is published as a blog post and not (yet) in a peer-reviewed journal.  The zcurve plot below was included in the rejected manuscript. So, readers can make up their own mind whether the rejection was justified or not.  Based on this plot, I challenge two claims about power-posing that were made by Cuddy et al. based on their p-curve analysis.
Z-Curve Shows Clear Evidence that Questionable Research Practices were Used by Power-Posing Researchers
Cuddy et al. write “no p-curve analysis by either set of authors yielded results that were left skewed or that suggested that the existing evidence was p-hacked.”
This statement is true, but it overlooks the fact that p-curve is often unable to detect p-hacking and other deceptive practicing like hiding studies that failed to provide evidence for power posing.   In contrast, z-curve makes it easy to see the effect of deceptive practices that are euphemistically called “questionable research practices.”
Power-Posing Meta-Analysis
Zcurve makes it easy to detect the influence of QRPs because the z-curve plot (a) includes the non-significant results (on the left of 1.96), (b) distinguishes between marginal significant results that are often used as weak but sufficient support for a hypothesis (z > 1.65 & z < 1.96), and differentiates a lot more among the significant results that are lumped into one category with p < .01 (z > 2.6).
The z-curve plot reveals that we are not looking at a healthy body of evidence. Zcurve projects the model prediction into the range of non-significant results. Although these projections are more biased because they are not based on actual data, the area below the projected curve is very large. In addition, the steep drop from 1.96 to 1.65 in the histogram shows clear evidence that questionable research practices eliminated non-significant results that we would expect from a distribution with a mode at 1.96.
The influence of questionable research practices is also implied by the z-curve estimate of 30% average power for significant results.  In contrast, the success rates are 79% if marginally significant results are excluded and 91% if they are included.  As power is a predictor of the probability to obtain a significant result, it is implausible that a set of studies with average power of 30% could produce 79% or 91% demonstrations that power posing works.  In the best case scenario, the average would be a fixed effect that is true for all studies (each study has 30% power). In this case, we expect to see 2 non-significant results for every significant result (30% successes and 70% failures) and with 42 significant results there should be 84 non-significant results. Even counting the marginally significant ones, we see only 11 non-significant results.
Thus, the claim that there is no evidence for p-hacking in power-posing research is totally inconsistent with the evidence. Not surprisingly, the authors also do not use a validated test of bias, the magic index, to test whether magic was used to produce 91% at least marginally significant results with just 30% power (Schimmack, 2012; also Francis, 2012).
In conclusion, p-curve is not a suitable statistical tool to examine the influence of p-hacking and other deceptive practices.   To demonstrate that there is no evidence for these practices with p-curve is like saying that a polar bear does not notice when it snows during hibernation. It is true, but totally irrelevant for the question whether it is snowing or not .
Power Estimation
The main aim of p-curve is to examine whether significant results provide evidence (e.g., for power posing effects) even if p-hacking or other deceptive methods were used.  Initially this was done by means of significance tests. If p < .05, a set of studies was set to provide evidence for real effects, whereas p > .05 results showed that there was insufficient evidence to reject the alternative hypothesis that questionable research practices alone explained significant results. In other word, the null-hypothesis is that p-hacking alone produced 44 significant results without any real power posing effects.
The problem with relying exclusively on p-values is that p-values are sensitive to the effect size (how strong is the evidence for an effect) and sampling error (how much error is there in the estimated effect size).  As sample sizes increase, it gets easier and easier to show that at least some studies contain some evidence, even if the evidence is weak. To address this concern, it is important to complement p-values with information about effect sizes and this can be easily done with confidence intervals.  The p-curve result of 44% power tells us that the observed effect size is moderate.  It is not close to 5%, which would indicate that all studies are false positives and it is not close to 100%, which would show that all studies are likely to replicate in an exact replication study.
The 90% confidence interval suggests that power could be lower than 44%, but is unlikely to be lower than 23%.  At the upper end, the 63% average power also tell us that it is unlikely that the average study had more than 60% power.  Thus, power posing studies fall considerably short of the criterion for well-designed studies that they should have 80% power (Cohen, 1988).
It is therefore important to distinguish between two meanings of strong evidence.  Cuddy et al. (2018) are justified in claiming that a bias-corrected estimate of 44% average power in a set of 44 significant studies provides strong evidence against the null-hypothesis that all studies are false positive results. However,  average power of 44% also shows that each study individually has low power to detect power posing effects.
Like p-curve, z-curve aims to estimate the average power of studies that are selected for significance.  The main advantage of z-curve is that it allows for variation in power across studies (heterogeneity).  This seems a plausible assumption for a meta-analysis that includes manipulation checks of power feelings and actual outcome measures like risk taking. Evidently, we would expect stronger effects for feelings that are induced by a manipulation aimed at changing feelings than on an outcome like performance in public speaking or risk taking.
Power-Posing Meta-Analysis
The zcurve (same Figure as above is shown here so that you do not have to scroll back and force) provides clear evidence of heterogeneity. Most z-scores pile up close to significance (all z-scores < 2.6 imply p-values greater than .01). However, there are three studies with strong evidence and the range information shows that there are even some (actually only 1, not shown) z-scores with values above 6 (highest value = 7.23).
In our rejected manuscript, we showed with simulation studies that pcurve has problems with strong effects (high z-scores) and pcurve estimates of average increase a lot more than they should when a few studies with very strong evidence are added to a dataset.  This estimation bias explains the discrepancy between the pcurve estimate of 44% average power and the zcurvc estimate of 30% average power.
As I already pointed out in the rejected article, the bad behavior of pcurve is evident when the four studies with strong evidence are excluded. The p-curve estimate drops from 44% to 13%.  Of course, the average should decrease when the strongest evidence is excluded, but a drop by 31% is not plausible when only four studies are excluded. Going in reverse, if 4 studies with 100% power were added to 40 studies with 13% power, the new average power would be ((40*.13)+(4*1))/44 = 20% average power not 44% power.
In conclusion, the 44% average and the 23% lower bound of the 90% (alpha = 10% type-I error probability) confidence interval reported by Cuddy et al. are inflated because they used a biased tool to estimate average power.   Z-curve provides lower estimates of average power and the lower bound of the 95%CI is only 13%.
Even 13% average power in 44 studies would provide strong evidence against the null-hypothesis that all 44 studies are false positives,  and 30% average power clearly means that these studies were not all p-hacked false positives.  Although this may be considered good news, if we have a cynical view of psychological scientists,  the null-hypothesis is also a very low bar to cross.  We can reject the hypothesis that 100% of power posing results are p-hacked false positives, but we can also reject the hypothesis that most studies were well-designed studies with 80% power, which would yield a minimum of 40% average power (50% of studies with 80% power yields an average of 40% just for the well powered studies).
Heterogeneity of Power in Power Posing Studies
The presence of heterogeneity in power across power posing studies also has implications for the interpretation of average power. An average of 30% power can be obtained in many different ways. It could be that all studies have 30% power. in this case, all 44 studies with significant results that used different manipulations or outcome variables would show a true positive results. The 30% power estimate would only tell us that the studies had low power and that reported effect sizes are considerably inflated, but all studies are expected to replicate if they were repeated with larger samples to increase power. In other words, there is no need to be concerned about false positive psychology where most published results are false (positives). All results are true positives.
In contrast to this rosy and delusional interpretation of averages, it is also possible that the average is a result of a mixture of false and true positives. In the most extreme case, we can get an average of 30% power with 15 out of 44 (34%) false positive results, if all other studies have 100% power. Even this estimate is only an estimate that depends on numerous assumptions and the percentage of false positives could be higher or lower.
It is also not clear which of the significant results is a false positive and which result would be replicable in larger samples with higher power,  So, an average of 30% power tells us only that some of the significant results are true positives, but it does not tell us which studies produced true positives with meaningful effect sizes.  Only studies with 80% power or more can be expected to replicate with only slightly attenuated effect sizes.  But which power posing studies did have 80% power?  The average of 30% does not tell us this.
zcurve.Power.Posing2.png
Observed power (or the corresponding z-score) is correlated with true power.  The correlation is too weak to use observed power as a reliable indicator of true power for a single study, but in a set of studies, higher z-scores are more likely to reflect higher levels of true power.  Z-curve uses this correlation to estimate average power for different regions in the set of significant studies.  These estimates are displayed below the x-axis.  For z-scores between 2 and 2.5 (roughly .05 and .01), average power is only 22%.  However, for z-scores above 4, average power is above 50%.  This finding suggests that a small subset of power posing studies is replicable in exact replication studies, whereas the majority of studies has low power and the outcome of replication studies, even with larger samples, is uncertain because some of these studies may be false positives.
Thus modeling heterogeneity has the advantage that it is possible to examine to some extent variability in true power. If all studies had the same power, all segments would have the same estimate.  As heterogeneity increases, the true power of just significant results, p < .05 & p > .01, decreases and the power of studies with strong evidence, p < .001 increases.   For power-posing a few studies with strong evidence have a strong influence on average power.
Another novel feature of z-curve (that still needs to be validated with extensive simulation studies) is the ability to fit models that make assumptions about the percentage of false positive results.  It is not possible to estimate the actual percentage of false positives, but it is possible to see what the worst case scenario would be.  To do so, a new (beta) version of zcurve fits models with 0 to 100% false positives and tries to optimize prediction of the observed distribution of z-scores as much as possible. A model that makes unrealistic assumptions will not fit the data well.
The plot below shows that a model with 100% false positive results does not fit the data. This is already implied by the 95%CI of average power that does not include 5%.  The novel contribution of this plot is to see at what point the model can fit the observed distribution with a maximum number of false positives.  The scree plot below suggests that models with up to 40% false positives fit the data about as well as a model with 0% false positives.  So, it is possible that there are no false positives, but it is also possible that there are up to 40% false positives.  In this case, 60% of studies would have about  50% power (.5 * .6 = .30) and 40% would have 5% power (which is the probability of false positives to produce significant results with alpha = 5%; .40 * .05 = .02 = 2%).
true.null.percentage.plot
In conclusion, it is misleading to interpret average power of 30% as strong evidence if a set of studies is heterogeneous.  The reason is that different studies with different manipulations or outcome variables produced different results and the average does not apply to any individual studies.  In this way,  using the average to draw inferences about individual studies is like stereotyping.  Just because a study was drawn from a sample with average power of 30% does not mean that this study has 30% power.  At the level of individual studies, most of these studies produced evidence for power posing with the help of luck and questionable research practices and exact replication studies are unlikely to be successful.  Thus, any strong conclusions about power posing based on these studies are not supported by strong evidence.
Again, this is not a problem of Cuddy’s analysis. The abstract correctly reports the results of their p-curve analysis.
“Several p-curve analyses based on a systematic review of the current scientific literature on adopting expansive postures reveal strong evidential value for posturalfeedback (i.e., power-posing) effects and particularly robust evidential value for effects on emotional and affective states (e.g., mood and evaluations, attitudes, and feelings about the self).” (Cuddy et al., Psychological Science, 2018)

The problem is that the p-curve analysis is misleading because it does not reveal the strong influence of questionable research practices, it overestimates average power, and it ignores heterogeneity in the strength of evidence across studies.

Peer-Review 
Prominent representatives of the American Psychological Society (I am proud not to be a member) have warned about the increasing influence of bloggers that were unkindly called method terrorists.   APS wants you to believe that closed and anonymous peer-review is working as a quality control mechanism and that bloggers are frustrated, second grade scientists who are unable to publish in top journals.
The truth is that peer-review is not working.  Peer-review works in academia works as well as asking one cat to make sure that the other cat doesn’t eat the chicken, while you are still fixing a salad (enjoy your vegetarian dinner).
The main points about p-curve in general and the power-posing p-curve were made in a manuscript that Jerry Brunner and I submitted to a new journal of APS that claims to represents Open Science and aims to improve scientific standards in psychological science.  Given the conflict of interest, I requested that the main author of p-curve should not be a reviewer.  The editor responded to this request by making another p-curve author a reviewer and this reviewer submitted a review that ignored major aspects of our criticism of p-curve (including simulation studies that prove our point) and objected to my criticism of the p-curve power posing meta-analysis.  The manuscript was rejected without an opportunity to respond to misleading reviewer comments.  The main reason for the rejection was that there was insufficient interest in p-curve or z-curve, while at the same time another APS journal had accepted the p-curve paper by Cuddy that is now cited as strong evidence for power posing effects.
Whether this was the right decision or not depends of course on the strength of the arguments that I presented here.  As I said, I can only believe that they are strong because I wouldn’t be writing this blog post if I thought they were weak.  So, I can only draw (possibly false) inferences about peer-review and APS based on the assumption that I presented strong arguments.  Based on this assumption, I feel justified in returning the favor for being branded a method terrorist and for being called out in another APS journal as a hateful blogger.
In response to the reaction by APS representatives to z-curve, I feel justified in calling some areas of psychology, mostly experimental social psychology which I have examined thoroughly, a failed science and APS (and APA) Errorist Organizations (not the t-word that APS representatives used to label critics like me) with no interest in examining errors that psychology science made.  I also believe that the rejection of manuscripts that show the validity of zcurve can be explained by fear about what this method may reveal.  Just like professional athletes who use performance enhancing substances are afraid of doping tests, scientists who use questionable research methods feel uneasy when a statistical method can reveal these practices.  Even if they no longer use doping these days,  their past published work is akin to frozen urine samples from the past that reveal massive doping in the past, when doping tests were unable to detect these drugs. Although fear of the truth is just one possible explanation, I find it difficult to come up with alternative explanations for dismissing a method that can examine the credibility and replicability of published findings as uninteresting and irrelevant.
Pepsi and Porsche 
P-Curve and Z-Curve have the same objective (there is also an effect size p-curve, but I am focusing on power here).  They both aim to estimate average power of a set of studies that were selected for significance.  When average power is low (which also implies low heterogeneity) both methods produce similar results and in some simulations p-curve performs slightly better (as we demonstrated ourselves in our own simulations).  So, one could think about pcurve and zcurve as two very similar products like pepsi or coke.  Not exactly the same but similar enough.  Competition between pcurve and zcurve would be mostly limited to marketing (pcurve has an online app, zcurve does not – yet).
However, I hope that I made some convincing arguments why pcurve and zcurve are more like a car and a Porsche (Made in Germany).  They both get you to where you want to go most of the time, but a Porsche offers you a lot more.  Zcurve is like the Porsche in this analogy, but it is also free (a free online app will be available soon).
Conclusion
My conclusion is that Zcurve is great tool that makes it possible to examine the credibility of published results in published studies.  The tool can be applied to any set of studies, whether they are studies of a specific topic or a heterogeneous set of studies published in a journal.  It can even be used to estimate the replicability of psychology based on thousands of articles and over a million test statistics and it can reveal whether recent initiatives for rescuing psychological science are actually having an effect on the credibility and replicability of published results.
Whether this conclusion is right or wrong is not for me to decide.  This decision will be made by the process called science, but for this process to work, the arguments and the evidence needs to be examined by scientists.  APS and APA made it clear that they do not want this to happen in their peer-reviewed, for pay journals, but that will not stop me from exposing zcurve and my reputation to open and harsh criticism, and this blog and many other blog posts on this site allow me to do this.
As always comments are welcome.