All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Replicability Review of 2016

2016 was surely an exciting year for anybody interested in the replicability crisis in psychology. Some of the biggest news stories in 2016 came from attempts by the psychology establishment to downplay the replication crisis in psychological research (Weired Magazine). At the same time, 2016 delivered several new replication failures that provide further ammunition for the critics of established research practices in psychology.

I. The Empire Strikes Back

1. The Open Science Collaborative Reproducibility Project was flawed.

Daniel Gilbert, Tim Wilson published a critique of the Open Science Collaborative in Science. According to Gilbert and Wilson the project that replicated 100 original research studies and reported that they could only replicate 36% was error riddled. Consequently, the low success rate only reveals the incompetence of replicators and has no implications for the replicability of original studies published in prestigious psychological journals like Psychological Science. Science Daily suggested that the critique overturned the landmark study.

science-daily-overturn

Nature published a more balanced commentary.  In an interview, Gilbert explains that “the number of studies that actually did fail to replicate is about the number you would expect to fail to replicate by chance alone — even if all the original studies had shown true effects.”   This quote is rather strange, if we really consider the replication studies as flawed and error riddled.  If the replication studies were bad, we would expect fewer studies to replicate than we would expect based on chance alone.  If the success rate of 36% is consistent with the effect of chance alone, the replication studies are just as good as the original studies and the only reason for non-significant results would be chance. Thus, Gilbert’s comment implies that he believes the typical statistical power of a study in psychology is about 36%. Gilbert doesn’t seem to realize that he is inadvertently admitting that published articles report vastly inflated success rates because 97% of the original studies reported a significant result.  To report 97% significant results with an average power of 36%, researchers are either hiding studies that failed to support their favored hypotheses in proverbial file-drawers or they are using questionable research practices to inflate evidence in favor of their hypotheses. Thus, ironically Gilberts’ comments rather confirm the critiques of the establishment that the low success rate in the reproducibility project can be explained by selective reporting of evidence that supports authors’ theoretical predictions.

2. Contextual Sensitivity Explains Replicability Problem in Social Psychology

Jay van Bavel and colleagues made a second attempt to downplay the low replicability of published results in psychology. He even got to write about it in the New York Times.

vanbavel-nyt

Van Bavel blames the Open Science Collaboration for overlooking the importance context. “Our results suggest that many of the studies failed to replicate because it was difficult to recreate, in another time and place, the exact same conditions as those of the original study.”   This statement caused a lot of bewilderment.  First, the OSC carefully tried to replicate the original studies as closely as possible.  At the same time, they were sensitive to the effect of context. For example, if a replication study of an original study in the US was carried out in Germany, stimulus words were translated from English into German because one might expect that native German speakers might not respond the same way to the original English words as native English speakers.  However, the switching of languages means that the replication study is not identical to the original study. Maybe the effect can only be obtained with English speakers. And if the study was conducted at Harvard, maybe the effect can only be replicated with Harvard students. And if the study was conducted primarily with female students, it may not replicate with male students.

To provide evidence for his claim, Jay van Bavel obtained subjective ratings of contextual sensitivity. That is, raters guessed how sensitivity the outcome of a study is to variations in the context.  These ratings were then used to predict the success of the 100 replication studies in the OSC project.

Jay van Bavel proudly summarized the results in the NYT article. “As we predicted, there was a correlation between these context ratings and the studies’ replication success: The findings from topics that were rated higher on contextual sensitivity were less likely to be reproduced. This held true even after we statistically adjusted for methodological factors like sample size of the study and the similarity of the replication attempt. The effects of some studies could not be reproduced, it seems, because the replication studies were not actually studying the same thing.”

The article leaves out a few important details.  First, the correlation between contextual sensitivity ratings and replication success was small, r = .20.  Thus, even if contextual sensitivity contributed to replication failures, it would only explain replication failures for a small percentage of studies. Second, the authors used several measures of replicability and some of these measures failed to show the predicted relationship. Third, the statement makes an elementary mistake of confusing correlation and causality.  The authors merely demonstrated that subjective ratings of contextual sensitivity predicted outcomes of replication studies. They did not show that contextual sensitivity caused replication failures.  Most important, Jay van Bavel failed to mention that they also conducted an analysis that controlled for discipline. The Open Science Collaborative had already demonstrated that studies in cognitive psychology are more replicable (50% success rate) than studies in social psychology (an awful 25%).  In an analysis that controlled for differences in disciplines, contextual sensitivity was no longer a statistically significant predictor of replication failures.  This hidden fact was revealed in a commentary (or should we say correction) by Joel Inbar.  In conclusion, this attempt at propping up the image of social psychology as a respectable science with replicable results turned out to be another embarrassing example of sloppy research methodology.

3. Anti-Terrorism Manifesto by Susan Fiske

Later that year, former president of the Association for Psychological Science (APS) caused a stir by comparing critics of established psychology to terrorists (see Business Insider article).  She later withdrew the comparison to terrorists in response to the criticism of her remarks on social media (APS website).

Fiske.png

Fiske attempted to defend established psychology by arguing that established psychology is self-correcting and does not require self-appointed social-media vigilantes. She claimed that these criticisms were destructive and damaging to psychology.

“Our field has always encouraged — required, really — peer critiques.”

“To be sure, constructive critics have a role, with their rebuttals and letters-to-the-editor subject to editorial oversight and peer review for tone, substance, and legitimacy.”

“One hopes that all critics aim to improve the field, not harm people. But the fact is that some inappropriate critiques are harming people. They are a far cry from temperate peer-reviewed critiques, which serve science without destroying lives.”

Many critics of established psychology did not share Fiske’s rosy and false description of the way psychology operates.  Peer-review has been shown to be a woefully unreliable process. Moreover, the key criterion for accepting a paper is that it presents flawless results that seem to support some extraordinary claims (a 5-minute online manipulation reduces university drop-out rates by 30%), no matter how these results were obtained and whether they can be replicated.

In her commentary, Fiske is silent about the replication crisis and does not reconcile her image of a critical peer-review system with the fact that only 25% of social psychological studies are replicable and some of the most celebrated findings in social psychology (e.g., elderly priming) are now in doubt.

The rise of blogs and Facebook groups that break with the rules of the establishment poses a threat to the APS establishment with the main goal of lobbying for psychological research funding in Washington. By trying to paint critics of the establishment as terrorists, Fiske tried to dismiss criticism of established psychology without having to engage with the substantive arguments why psychology is in crisis.

In my opinion her attempt to do so backfired and the response to her column showed that the reform movement is gaining momentum and that few young researchers are willing to prop up a system that is more concerned about publishing articles and securing grant money than about making real progress in understanding human behavior.

II. Major Replication Failures in 201

4. Epic Failure to Replicate Ego-Depletion Effect in a Registered Replication Report

Ego-depletion is a big theory in social psychology and the inventor of the ego-depletion paradigm, Roy Baumeister, is arguable one of the biggest names in contemporary social psychology.  In 2010, a meta-analysis seemed to confirm that ego-depletion is a highly robust and replicable phenomenon.  However, this meta-analysis failed to take publication bias into account.  In 2014, a new meta-analysis revealed massive evidence of publication bias. It also found that there was no statistically reliable evidence for ego-depletion after taking publication bias into account (Slate, Huffington Post).

Ego.Depletion.Crumbling.png

A team of researchers, including the first-author of the supportive meta-analysis from 2010, conducted replication studies, using the same experiment in 24 different labs.  Each of these studies alone would have had a low probability to detect a small ego depletion effect, but the combined evidence from all 24 labs made it possible to detect an ego-depletion effect even if it were much smaller than published articles suggest.  Yet, the project failed to find any evidence for an ego-depletion effect, suggesting that it is much harder to demonstrate ego-depletion effects than one would believe based on over 100 published articles with successful results.

Critics of Baumeister’s research practices (Schimmack) felt vindicated by this stunning failure. However, even proponents of ego-depletion theory (Inzlicht) acknowledged that ego-depletion theory lacks a strong empirical foundation and that it is not clear what 20 years of research on ego-depletion have taught us about human self-control.

Not so, Roy Baumeister.  Like a bank that is too big to fail, Baumeister defended ego-depletion as a robust empirical finding and blamed the replication team for the negative outcome.  Although he was consulted and approved the design of the study, he later argued that the experimental task was unsuitable to induce ego-depletion. It is not hard to see the circularity in Baumeister’s argument.  If a study produces a positive result, the manipulation of ego-depletion was successful. If a study produces a negative result, the experimental manipulation failed. The theory is never being tested because it is taken for granted that the theory is true. The only empirical question is whether an experimental manipulation was successful.

Baumeister also claimed that his own lab has been able to replicate the effect many times, without explaining the strong evidence for publication bias in the ego-depletion literature and the results of a meta-analysis that showed results from his own lab are no different from results from other labs.

A related article by Baumeister in a special issue on the replication crisis in psychology was another highlight in 2016.  In this article, Baumeister introduced the concept of FLAIR.

scientist-with-flair   Scientist with FLAIR

Baumeister writes “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants. Over the years the norm crept up to about n = 20. Now it seems set to leap to n = 50 or more.” (JESP, 2016, p. 154).  He misses the god old days and suggests that the old system rewarded researchers with flair.  “Patience and diligence may be rewarded, but competence may matter less than in the past. Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure. Flair, intuition, and related skills matter much less with n = 50.” (JESP, 2016, p. 156).

This quote explains the low replication rate in social psychology and the failure to replicate ego-depletion effects.   It is simply not possible to conduct studies with n = 10 and be successful in most studies because empirical studies in psychology are subject to sampling error.  Each study with n = 10 on a new sample of participants will produce dramatically different results because sample of n = 10 are very different from each other.  This is a fundamental fact of empirical research that appears to elude on of the most successful empirical social psychologists.  So, a researcher with FLAIR may set up a clever experiment with a strong manipulation (e.g, smelling chocolate cookies and have participants eat radishes instead) and get a significant result. But this is not a replicable finding. For every study with fair that worked, there are numerous studies that did not work. However, researchers with flair ignore these failed studies and focus on the studies that worked and then use these studies for publication.  It can be shown statistically that they do, as I did with Baumeister’s glucose studies (Schimmack, 2012) and Baumeister’s ego-depletion studies in general (Schimmack, 2016).  So, a researchers who gets significant results with small samples (n = 10) surely has FLAIR (False, Ludicrous, And Incredible Results).

Baumeister’s article contained additional insights into the research practices that fueled a highly productive and successful career.  For example, he distinguishes researchers who report boring true positive results and interesting researches who publish interesting false positive results.  He argues that science needs both types of researchers. Unfortunately, most people assume that scientists prioritize truth, which is the main reason for subjecting theories to empirical tests. But scientists with FLAIR get positive results even when their interesting ideas are false (Bem, 2011).

Baumeister mentions psychoanalysis as an example of interesting psychology. What could be more interesting than the Freudian idea that every boy goes through a phase where he wants to kill daddy and make love to mommy.  Interesting stuff, indeed, but this idea has no scientific basis.  In contrast, twin studies suggest that many personality traits, values, and abilities are partially inherited. To reveal this boring fact, it was necessary to recruit large samples of thousands of twins.  That is not something a psychologist with FLAIR can handle.  “When I ran my own experiments as a graduate student and young professor, I struggled to stay motivated to deliver the same instructions and manipulations through four cells of n=10 each. I do not know how I would have managed to reach n=50. Patient, diligent researchers will gain, relative to others” (Baumeister, JESP, 2016, p. 156). So, we may see the demise of researchers with FLAIR and diligent and patient researchers who listen to their data may take their place. Now there is something to look forward to in 2017.

scientist-without-flair Scientist without FLAIR

5. No Laughing Matter: Replication Failure of Facial Feedback Paradigm

A second Registered Replication Report (RRR) delivered another blow to the establishment.  This project replicated a classic study on the facial-feedback hypothesis.  Like other peripheral emotion theories, facial-feedback theories assume that experiences of emotions depend (fully or partially) on bodily feedback.  That is, we feel happy because we smile rather than we smile because we are happy.  Numerous studies had examined the contribution of bodily feedback to emotional experience and the evidence was mixed.  Moreover, studies that found effects had a major methodological problem. Simply asking participants to smile might make them think happy thoughts, which could elicit positive feelings.  In the 1980s, social psychologist Fritz Strack invented a procedure that solved this problem (see Slate article).  Participants are deceived to believe that they are testing a procedure for handicapped people to complete a questionnaire by holding a pen in their mouth.  Participants who hold the pen with their lips are activating muscles that are activated during sadness. Participants who hold the pen with their teeth activate muscles that are activated during happiness.  Thus, randomly assigning participants to one of these two conditions made it possible to manipulate facial muscles without making participants aware of the associated emotion.  Strack and colleagues reported two experiments that showed effects of the experimental manipulation.  Or did it?  It depends on the statistical test being used.

slate-facial-feedback

Experiment 1 had three conditions. The control group did the same study without manipulation of the facial muscles. The dependent variable was funniness ratings of cartoons.  The mean funniness of cartoons was highest in the smile condition, followed by the control condition, and the lowest mean in the frown condition.  However, a commonly used Analysis of Variance would not have produced a significant result.  A two-tailed t-test also would not have produced a significant result.  However a linear contrast with a one-tailed t-test produced a just significant result, t(89) = 1.85, p = .03.  So, Fritz Strack was rather lucky to get a significant result.  Sampling error could have easily changed the pattern of means slightly and even the directional test of the linear contrast would not have been significant.  At the same time, sampling error might have been against the facial feedback hypothesis and the real effect is stronger than this study suggests. In this case, we would expect to see stronger evidence in Study 2.  However, Study 2 failed to show any effect on funniness ratings of cartoons.  “As seen in Table 2, subjects’ evaluations of the cartoons were hardly affected under the different experimental conditions. The ANOVA showed no significant main effects or interactions, all ps > .20” (Strack et al., 1988).  However, Study 2 also included amusement ratings, and the amusement ratings once more showed a just significant result with a one-tailed t-test, t(75) = 1.78, p = .04.  The article also provides an explanation for the just-significant result in Study 1, even though Study 1 used funniness ratings of cartoons.  When participants are not asked to differentiate between their subjective feelings of amusement and the objective funniness of cartoons, subjective feelings influence ratings of funniness, but given a chance to differentiate between the two, subjective feelings no longer influence funniness ratings.

For 25 years, this article was uncritically cited as evidence for the facial feedback hypothesis, but none of the 17 labs that participated in the RRR were able to produce a significant result. More important, even an analysis with the combined power of all studies failed to detect an effect.  Some critics pointed out that this result successfully replicates the finding of the original two studies that also failed to report statistically significant results by conventional standards of a two-tailed test (or z > 1.96).

Given the shaky evidence in the original article, it is not clear why Fritz Strack volunteered his study for a replication attempt.  However, it is easier to understand his response to the results of the RRR.  He does not take the results seriously.  He rather believes his two original, marginally significant, studies than the 17 replication studies.

“Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.”  (Slate).

One of the most bizarre statements by Strack can only be interpreted as revealing a shocking lack of understanding of probability theory.

“So when Strack looks at the recent data he sees not a total failure but a set of mixed results. Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed? Maybe there’s a reason why half the labs could not elicit the effect.” (Slate).

This is like a roulette player who after a night of gambling sees 49% wins and 49% loses and ponders why 49% of the attempts produced losses. Strack does not seem to realize that results of individual studies move simply by chance just like roulette balls produce different results by chance. Some people find cartoons funnier than others and the mean will depend on the allocation of these individuals to the different groups.  This is called sampling error, and this is why we need to do statistical tests in the first place.  And apparently it is possible to become a famous social psychologist without understanding the purpose of computing and reporting p-values.

And the full force of defense mechanisms is apparent in the next statement.  “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. (Slate).

No, there were not eight non-replications. There were 17!  We would expect half of the studies to match the direction of the original effect simply due to chance alone.

But this is not all.  Strack even accused the replication team of “reverse p-hacking.” (Strack, 2016).  The term p-hacking was coined by Simmons et al. (2011) to describe a set of research practices that can be used to produce statistically significant results in the absence of a real effect (fabricating false positives).  Strack turned it around and suggested that the replication team used statistical tricks to make the facial feedback effect disappear.  “Without insinuating the possibility of a reverse p hacking, the current anomaly needs to be further explored.” (p. 930).

However, the statistical anomaly that requires explanation could just be sampling error (Hillgard) and it actually is the wrong statistical pattern to claim reverse p-hacking.  Reverse p-hacking implies that some studies did produce a significant result, but statistical tricks were used to report the result as non-significant. This would lead to a restriction in the variability of results across studies, which can be detected with the Test for Insufficient Variance (Schimmack, 2015), but there is no evidence for reverse p-hacking in the RRR.

Fritz Strack also tried to make his case on social media, but there was very little support for his view that 17 failed replication studies can be ignored (PsychMAP thread).

strack-psychmap

Strack’s desperate attempts to defend his famous original study in the light of a massive replication failure provide further evidence for the inability of the psychology establishment to face the reality that many celebrated discoveries in psychology rest on shaky evidence and a mountain of repressed failed studies.

Meanwhile the Test of Insufficient Variance provides a simple explanation for the replication failure, namely the original results were rather unlikely to occur in the first place.  Converting the observed t-values into z-scores shows very low variability, Var(z) = 0.003. The probability of observing a variance this small or smaller in a pair of studies is only p = .04.  It is just not very likely for such an improbable event to repeat itself

6. Insufficient Power in Power-Posing Research

When you google “power posing” the featured link shows Amy Cuddy giving a TED talk about her research. Not unlike facial feedback, power posing assumes that bodily feedback can have powerful effects.

Cuddy.Power.Posing.png

When you scroll down to the page, you might find a link to an article by Gelman and Fung (Slate).

Gelman has been an outspoken critic of social psychology for some time.  This article is no exception. “Some of the most glamorous, popular claims in the field are nothing but tabloid fodder. The weakest work with the boldest claims often attracts the most publicity, helped by promotion from newspapers, television, websites, and best-selling books.”

Wonder.Woman.png

They point out that a much larger study than the original study failed to replicate the original findings.

“An outside team led by Eva Ranehill attempted to replicate the original Carney, Cuddy, and Yap study using a sample population five times larger than the original group. In a paper published in 2015, the Ranehill team reported that they found no effect.”

They have little doubt that the replication study can be trusted and suggest that the original results were obtained with the help of questionable research practices.

“We know, though, that it is easy for researchers to find statistically significant comparisons even in a single, small, noisy study. Through the mechanism called p-hacking or the garden of forking paths, any specific reported claim typically represents only one of many analyses that could have been performed on a dataset.”

The replication study was published in 2015, so this replication failure does not really belong into a review of 2016.  Indeed, the big news in 2016 was that Cuddy’s co-author Carney distanced herself from her contribution to the power posing article.   Her public rejection of her own work (New Yorker Magazine) spread like a wildfire through social media (Psych Methods FB Group Posts 1, 2, but  see 3). Most responses were very positive.  Although science is often considered a self-correcting system, individual scientists rarely correct mistakes or retract articles if they discover a mistake after publication.  Carney’s statement was seen as breaking with the implicit norm of the establishment to celebrate every published article as an important discovery and to cover up mistakes even in the face of replication failures.

carney-statement

Not surprisingly, proponent of power posing, Amy Cuddy, defended her claims about power posing. Here response makes many points, but there is one glaring omission. She does not mention the evidence that published results are selected to confirm theoretical claims and she does not comment on the fact that there is no evidence for power posing after correcting for publication bias.  The psychology establishment also appears to be more interested in propping up a theory that has created a lot of publicity for psychology rather than critically examining the scientific evidence for or against power posing (APS Annual Meeting, 2017, Program, Presidential Symposium).

7. Commitment Priming: Another Failed Registered Replication Report

Many research questions in psychology are difficult to study experimentally.  For example, it seems difficult and unethical to study the effect of infidelity on romantic relationships by assigning one group of participants to an infidelity condition and make them engage in non-marital sex.  Social psychologists have developed a solution to this problem.  Rather than creating real situations, participants are primed to think about infidelity. If these thoughts change their behavior, the results are interpreted as evidence for the effect of real infidelity.  Eli Finkel and colleagues used this approach to experimentally test the effect of commitment on forgiveness.  To manipulate commitment, participants in the experimental group were given some statements that were supposed to elicit commitment-related thoughts.  To make sure that this manipulation worked, participants then completed a commitment measure.  In the original article, the experimental manipulation had a strong effect, d = .74, which was highly significant, t(87) = 3.43, p < .001.  Irene Cheung, Lorne Campbell, and Etienne P. LeBel spearheaded an initiative to replicate the experimental effect of commitment priming on forgiveness.  Eli Finkel closely worked with the replication team to ensure that the replication study replicated the original study as closely as possible.  Yet, the replication studies failed to demonstrate effectiveness of the commitment manipulation. Even with the much larger sample size, there was no significant effect and the effect size was close to zero.  The authors of the replication report were surprised by the failure of the manipulation. “It is unclear why the RRR studies observed no effect of priming on subjective commitment when the original study observed a large effect. Given the straightforward nature of the priming manipulation and the consistency of the RRR results across settings, it seems unlikely that the difference resulted from extreme context sensitivity or from cohort effects (i.e., changes in the population between 2002 and 2015).” (PPS, 2016, p. 761).  The author of the original article, Eli Finkel, also has no explanation for the failure of the experimental manipulation. “Why did the manipulation that successfully influenced commitment in 2002 fail to do so in the RRR? I don’t know.” (PPS, 2016, p. 765).  However, Eli Finkel also reports that he made changes to the manipulation in subsequent studies. “The RRR used the first version of a manipulation that has been refined in subsequent work. Although I believe that the original manipulation is reasonable, I no longer use it in my own work. For example, I have become concerned that the “low commitment” prime includes some potentially commitment-enhancing elements (e.g., “What is one trait that your partner will develop as he/she grows older?”). As such, my collaborators and I have replaced the original 5-item primes with refined 3-item primes (Hui, Finkel, Fitzsimons, Kumashiro, & Hofmann, 2014). I have greater confidence in this updated manipulation than in the original 2002 manipulation. Indeed, when I first learned that the 2002 study would be the target of an RRR—and before I understood precisely how the RRR mechanism works—I had assumed that it would use this updated manipulation.” (PPS, 2016, p. 766).   Surprisingly, the potential problem with the original manipulation was never brought up during the planning of the replication study (FB discussion group).

commitment-priming-fb

Hui et al. (2014) also do not mention any concerns about the original manipulation.  They simply wrote “Adapting procedures from previous research (Finkel et al., 2002), participants in the high commitment prime condition answered three questions designed to activate thoughts regarding dependence and commitment.” (JPSP, 2014, p. 561).  The results of the manipulation check closely replicated the results of the 2002 article. “The analysis of the manipulation check showed that participants in the high commitment prime condition (M = 4.62, SD = 0.34) reported a higher level of relationship commitment than participants in the low commitment prime condition (M = 4.26, SD = 0.62), t(74) = 3.11, p < .01.” (JPSP, 2014, p. 561).  The study also produced a just-significant result for a predicted effect of the manipulation on support for partner’s goals that are incompatible with the relationship, relationship, beta = .23, t(73) = 2.01, p = .05.  These just significant results are rare and often fail to replicate in replication studies (OSC, Science, 2016).

Altogether the results of yet another registered replication report raise major concerns about the robustness of priming as a reliable method to alter participants’ beliefs and attitudes.  Selective reporting of studies that “worked” has created an illusion that priming is a very effective and reliable method to study social cognitions. However, even social cognition theories suggest that priming effects should be limited to specific situations and should not have strong effects for judgments that are highly relevant and when chronically accessible information is easily accessible.

8. Concluding Remarks

Looking back 2016 has been a good year for the reform movement in psychology.  High profile replication failures have shattered the credibility of established psychology.  Attempts by the establishment to discredit critics have backfired. A major problem for the establishment is that they themselves do not know how big the crisis is and which findings are solid.  Consequently, there has been no major initiative by the establishment to mount replication projects that provide positive evidence for some important discoveries in psychology.  Looking forward to 2017, I anticipate no major changes. Several registered replication studies are in the works, and prediction markets anticipate further failures.  For example, a registered replication report of “professor priming” studies is predicted to produce a null-result.

professor-priming-prediction

If you are still looking for a New Year’s resolution, you may consider signing on to Brent W. Roberts, Rolf A. Zwaan, and Lorne Campbell’s initiative to improve research practices. You may also want to become a member of the Psychological Methods Discussion Group, where you can find out in real time about major events in the world of psychological science.

Have a wonderful new year.

 

 

Advertisements

Z-Curve: Estimating Replicability of Published Results in Psychology (Revision)

Jerry Brunner and I developed two methods to estimate replicability of published results based on test statistics in original studies.  One method, z-curve, is used to provide replicabiltiy estimates in my powergraphs.

In September, we submitted a manuscript that describes these methods to Psychological Methods, where it was rejected.

We now revised the manuscript. The new manuscript contains a detailed discussion of various criteria for replicability with arguments why a significant result in an exact replication study is an important, if not the only, criterion to evaluate the outcome of replication studies.

It also makes a clear distinction between selection for significance in an original study and the file drawer problem in a series of conceptual or exact replication studies. Our methods only assumes selection for significance in original studies, but no file drawer or questionable research practices.  This idealistic assumption may explain why our model predicts a much higher success rate in the OSC reproducibility project (66%) than was actually obtained (36%).  As there is ample evidence for file-drawers with non-significant conceptual replication studies, we believe that file-drawers and QRP contribute to the low success rate in the OSC project. However, we also mention concerns about the quality of some replication studies.

We hope that the revised version is clearer, but fundamentally nothing has changed. Reviewers at Psychological Methods didn’t like our paper, the editor thought NHST is no longer relevant (see editorial letter and reviews), but nobody challenged our statistical method or the results of our simulation studies that validate the method. It works and it provides an estimate of replicability under very idealistic conditions, which means we can only expect a considerably lower success rate in actual replication studies as long as researchers file-drawer non-significant results.

 

How did Diedrik Stapel Create Fake Results? A forensic analysis of “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations”

Diederik Stapel represents everything that has gone wrong in experimental social psychology.  Until 2011, he was seen as a successful scientists who made important contributions to the literature on social priming.  In the article “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations” he presented 8 studies that showed that social comparisons can occur in response to stimuli that were presented without awareness (subliminally).  The results were published in the top journal of social psychology published by the American Psychological Association (APA) and APA published a press-release for the general public about this work.  
In 2011, an investigation into Diedrik Stapel’s reserach practices revealed scientific fraud, which resulted in over 50 retractions (Retraction Watch), including the article on unconscious social comparisons (Retraction Notice).  In a book, Diederik Stapel told his story about his motives and practices, but the book is not detailed enough to explain how particular datasets were fabricated.  All we know, is that he used a number of different methods that range from making up datasets to the use of questionable research practices that increase the chance of producing a significant result.  These practices are widely used and are not considered scientific fraud, although the end result is the same. Published results no longer provide credible empirical evidence for the claims made in a published article.
I had two hypotheses. First, the data could be entirely made up. When researchers make up fake data they are likely to overestimate the real effect sizes and produce data that show the predicted pattern much more clearly than real data would. In this case, bias tests would not show a problem with the data.  The only evidence that the data are fake would be that the evidence is stronger than in other studies that relied on real data.
In contrast, a researcher who starts with real data and then uses questionable practices is likely to use as little dishonest practices as possible because this makes it easier to justify the questionable decisions.  For example, removing 10% of data may seem justified, especially if some rational for exclusion can be found.  However, removing 60% of data cannot be justified.  The researcher will need to use these practices to produce the desired outcome, namely a p-value below .05 (or at least very close to .05).  As more use of questionable practices is not needed and harder to justify, the researcher will stop producing stronger evidence.  As a result, we would expect a large number of just significant results.
There are two bias tests that detect the latter form of fabricating significant results by means of questionable statistical methods; the Replicability-Index (R-Index) and the Test of Insufficient Variance (TIVA).   If Stapel used questionable statistical practices to produce just significant results, R-Index and TIVA would show evidence of bias.
The article reported 8 studies. The table shows the key finding of each study.
Study Statistic p z OP
1 F(1,28)=4.47 0.044 2.02 0.52
2A F(1,38)=4.51 0.040 2.05 0.54
2B F(1,32)=4.20 0.049 1.97 0.50
2C F(1,38)=4.13 0.049 1.97 0.50
3 F(1,42)=4.46 0.041 2.05 0.53
4 F(2,49)=3.61 0.034 2.11 0.56
5 F(1,29)=7.04 0.013 2.49 0.70
6 F(1,55)=3.90 0.053 1.93 0.49
All results were interpreted as evidence for an effect and the p-value for Study 6 was reported as p = .05.
All p-values are below .053 but greater than .01.  This is an unlikely outcome because sampling error should produce more variability in p-values.  TIVA examines whether there is insufficient variability.  First, p-values are converted into z-scores.  The variance of z-scores due to sampling error alone is expected to be approximately 1.  However, the observed variance is only Var(z) = 0.032.  A chi-square test shows that this observed variance is unlikely to occur by chance alone,  p = .00035. We would expect such an extremely small variability or even less variability in only 1 out of 2857 sets of studies by chance alone.
The last column transforms z-scores into a measure of observed power. Observed power is an estimate of the probability of obtaining a significant result under the assumption that the observed effect size matches the population effect size.  These estimates are influenced by sampling error.  To get a more reliable estimate of the probability of a successful outcome, the R-Index uses the median. The median is 53%.  It is unlikely that a set of 8 studies with a 53% chance of obtaining a significant result produced significant results in all studies.  This finding shows that the reported success rate is not credible. To make matters worse, the probability of obtaining a significant result is inflated when a set of studies contains too many significant results.  To correct for this bias, the R-Index computes the inflation rate.  With 53% probability of success and 100% success rate, the inflation rate is 47%.  To correct for inflation, the inflation rate is subtracted from median observed probability, which yields an R-Index of 53% – 47% = 6%.  Based on this value, it is extremely unlikely that a researcher would obtain a significant result, if they would actually replicate the original studies exactly.  The published results show that Stapel could not have produced these results without the help of questionable methods, which also means nobody else can reproduce these results.
In conclusion, bias tests suggest that Stapel actually collected data and failed to find supporting evidence for his hypotheses.  He then used questionable practices until the results were statistically significant.  It seems unlikely that he outright faked these data and intentionally produced a p-value of .053 and reported it as p = .05.  However, statistical analysis can only provide suggestive evidence and only Stapel knows what he did to get these results.

A sarcastic comment on “Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology” by Harry Reis

“Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology”
Journal of Experimental Social Psychology 66 (2016) 148–152
DOI: http://dx.doi.org/10.1016/j.jesp.2016.01.005

a.k.a The Swan Song of Social Psychology During the Golden Age

Disclaimer: i wrote this piece because Jamie Pennebeker recommended writing as therapy to deal with trauma.  However, in his defense, he didn’t propose publishing the therapeutic writings.

————————————————————————-

You might think an article with reproducibiltiy in the title would have something to say about the replicability crisis in social psychology.  However, this article has very little to say about the causes of the replication crisis in social psychology and possible solutions to improve replicability. Instead, it appears to be a perfect example of repressive coping to avoid the traumatic realization that decades of work were fun, yet futile.

1. Introduction

The authors start with a very sensible suggestion. “We propose that the goal of achieving sound scientific insights and useful applications will be better facilitated over the long run by promoting good scientific practice rather than by stressing the need to prevent any and all mistakes.”  (p. 149).  The only question is how many mistakes we consider tolerable and that we do not know what the error rates are. Rosenthal pointed out it could be 100%, which even the authors might consider to be a little bit too high.

2. Improving research practice”

In this chapter, the authors suggest that “if there is anything on which all researchers might agree, it is the call for improving our research practices and techniques.” (p. 149).  If this were the case, we wouldn’t see articles in 2016 that make statistical mistakes that have been known for decades like pooling data from a heterogeneous set of studies or computing difference scores and using one of the variables as a predictor of the difference score.

It is also puzzling to read “the contemporary literature indicates just how central methodological innovation has been to advancing the field” (p. 149), when the key problem of low power has been known since 1962 and there is still no sign of improvement.

The authors also are not exactly in favor of adapting better methods, when these methods might reveal major problems in older studies.  For example, a meta-analysis in 2010 might not have examined publication bias and produced an effect size of more than half a standard deviation, when a new method that controls for publication bias finds that it is impossible to reject the null-hypothesis. No, these new methods are not welcome. “In our  view, they will stifle progress and innovation if they are seen primarily through the lens of maladaptive perfectionism; namely as ways of rectifying flaws and shortcomings in prior work.”  (p. 149).  So, what is the solution. Let’s pretend that subliminal priming made people walk slower in 1996, but stopped working in 2011?

This ends the chapter of improving research practice.  Yes, that is the way to deal with a crisis.  When the city is bankrupt, cut back on the Christmas decorations. Problem solved.

3. How to think about replications

Let’s start with a trivial statement that is as meaningless as saying, we would welcome more funding.  “Replications are valuable.” (p. 149).  Let’s also not mention that social psychologists have been the leader of requesting replication studies. No single study article shall be published in a social psychology journal. A minimum of three studies with conceptual replications of the key finding are needed to show that the results are robust and always produce significant results with p < .05 (or at least p < .10).  Yes, no other science has cherished replications as much as social psychology.

And eminent social psychologists Crandall and Sherman explain why. “to be a cumulative
and self-correcting enterprise, replications, be their results supportive, qualifying, or contradictory, must occur.”  Indeed, but what explains the 95% success rate of published replications in social psychology.  No need for self-correction, if the predictions are always confirmed.

Surprisingly, however, since 2011 a number of replication studies have been published in obscure journals that fail to replicate results.  This has never happened before and raises some concerns. What is going on here?  Why can these researchers not replicate the original results?  The answer is clear. They are doing it wrong.  “We concur with several authors (Crandall and Sherman, Stroebe) that conceptual replications offer the greatest potential to our field…  Much of the current debate, however, is focused narrowly on direct
or exact replications.” (p. 149). As philosopher know, you cannot step into the same river twice and so you cannot replicate the same study again.  To get a significant result, you need to do a similar, but not an identical replication study.

Another problem with failed replication studies is that these researchers assume that they are doing an exact replication study, but do not test this assumption. “In this light, Fabrigar’s insistence that researchers take more care to demonstrate psychometric invariance is well-placed” (p. 149).  Once more, the superiority of conceptual replication studies is self-evident. When you do a conceptual replication study, psychometric invariance is guaranteed and does not have to be demonstrated. Just one more reason, why conceptual replication studies in social psychology journals produce 95% success rate, whereas misguided exact replication attempts have failure rates of over 50%.

It is also important to consider the expertise of researchers.  Social psychologists often have demonstrated their expertise by publishing dozens of successful, conceptual replications.  In contrast, failed replications are often produced by novices with no track-record of ever producing a successful study.  These vast differences in previous success rate need to be taken into account in the evaluation of replication studies.  “Errors caused by low expertise or inadvertent changes are often catastrophic, in the sense of causing a study to fail completely, as Stroebe nicely illustrates.”

It would be a shame if psychology would start rewarding these replication studies.  Already limited research funds would be diverted to conducting studies that are easy to do, yet to difficult to do correctly for inexperienced researchers away from senior researchers who do difficult novel studies that always work and produced groundbreaking new insights into social phenomena during the “golden age” (p. 150) of social psychology.

The authors also point that failed studies are rarely failed studies. When these studies are properly combined with successful studies in a meta-analysis, the results nearly always show the predicted effect and that it was wrong to doubt original studies simply because replication studies failed to show the effect. “Deeper consideration of the terms “failed” and “underpowered” may reveal just how limited the field is by dichotomous thinking. “Failed” implies that a result at p = .06 is somehow inferior to one at p = .05, a conclusion
that scarcely merits disputation.” (p. 150).

In conclusion, we learn nothing from replication studies. They are a waste of time and resources and can only impede further development of social psychology by means of conceptual replication studies that build on the foundations laid during the “golden age” of social psychology.

4. Differential demands of different research topics

Some studies are easier to replicate than others, and replication failures might be “limited to studies that presented methodological challenges (i.e., that had protocols that were considered difficult to carry out) and that provided opportunities for experimenter bias” (p. 150).  It is therefore better, not to replicate difficult studies or to let original authors with a track-record of success conduct conceptual replication studies.

Moreover, some people have argued that the high succeess rate of original studies is inflated by publication bias (not writing up failed studies) and the use of questionable research practices (run more participants until p < .05).  To ensure that reported successes are real successes, some initiatives call for data sharing, pre-registration of data analysis plans, and a priori power analysis.  Although these may appear to be reasonable suggestions, the authors disagree.  “We worry that reifying any of the various proposals as a “best practice” for research integrity may marginalize researchers and research areas that study phenomena or use methods that have a harder time meeting these requirements.” (p. 150).

They appear to be concerns that researchers who do not preregister data analysis plans or do not share data may be stigmatized. “If not, such principles, no matter how well-intentioned, invite the possibility of discrimination, not only within the field but also by decision-makers who are not privy to these realities.”  (p. 150).

5. Considering broader implications

These are confusing times.  In the old days, the goal of research was clearly defined. Conduct at least three, loosely related , successful studies and write them up with a good story.  During these times, it was not acceptable to publish failed studies to maintain the 95% success rate. This made it hard for researchers who did not understand the rules of publishing only significant results. “Recently, a colleague of ours relayed his frustrating experience of submitting a manuscript that included one null-result study among several studies with statistically significant findings. He was met with rejection after rejection, all the while being told that the null finding weakened the results or confused the manuscript” (p. 151).

It is not clear what researchers should be doing now. Should they now report all of their studies, the good, the bad, and the ugly, or should they continue to present only the successful studies?   What if some researchers continue to publish the good old fashioned way that evolved during the golden age of social psychology and others try to publish results more in accordance with what actually happened in their lab?  “There is currently, a disconnect between what is good for scientists and what is good for science” and nobody is going to change while researchers who report only significant results get rewarded with publications in top journals.

 

 

 

 

 

There may also be little need to make major changes. “We agree with Crandall and Sherman, and also Stroebe, that social psychology is, like all sciences, a self-correcting enterprise” (p. 151).   And if social psychology is already self-correcting, it do not need new guidelines how to do research and new replication studies. Rather than instituting new policies, it might be better to make social psychology great again. Rather than publishing means and standard deviations or test statistics that allow data detectives to check results, it might be better to report only whether a result was significant, p < .05, and because 95% of studies are significant and the others are failed studies, we might simply not report any numbers.  False results will be corrected eventually because they will no longer be reported in journals and the old results might have been true even if they fail to replicate today.   The best approach is to fund researchers with a good track record of success and let them publish in the top journals.

 

Most likely, the replication crisis only exists in the imagination of overly self-critical psychologists. “Social psychologists are often reputed to be among the most severe critics of work within their own discipline” (p. 151).  A healthier attitude is to realize that “we already know a lot; with these practices, we can learn even more” (p. 151).

So, let’s get back to doing research and forget this whole thing that was briefly mentioned in the title called “concerns about reproducibility.”  Who cares that only 25% of social psychology studies from 2008 could be replicated in 2014.  In the meantime, thousands of new discoveries were made and it is time to make more new discoveries. “We should not get so caught up in perfectionistic concerns that they impede the rapid accumulation and dissemination of research findings” (p. 151).

There you have it folks.  Don’t worry about recent failed replications. This is just a normal part of science, especially a science that studies fragile, contextually sensitive phenomena. The results from 2008 do not necessarily replicate in 2014 and the results from 2014 may not replicate in 2018.  What we need is fewer replications. We need permanent research because many effects may disappear the moment they were discovered. This is what makes social psychology so exciting.  If you want to study stable phenomena that replicate decade after decade you might as well become a personality psychologist.

 

 

 

 

A replicability analysis of”I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning”

Dijksterhuis, A. (2004). I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning. JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY,   Volume: 86,   Issue: 2,   Pages: 345-355. 

DOI: 10.1037/0022-3514.86.2.345

There are a lot of articles with questionable statistical results and it seems pointless to single out particular articles.  However, once in a while, an article catches my attention and I will comment on the statistical results in it.  This is one of these articles….

The format of this review highlights why articles like this passed peer-review and are cited at high frequency as if they provided empirical facts.  The reason is a phenomenon called “verbal overshadowing.”   In work on eye-witness testimony, participants first see the picture of a perpetrator. Before the actual line-up task, they are asked to give a verbal description of the tasks.  The verbal description can distort the memory of the actual face and lead to a higher rate of misidentifications.  Something similar happens when researchers read articles. Sometimes they only read abstracts, but even when they read the article, the words can overshadow the actual empirical results. As a result, memory is more strongly influenced by verbal descriptions than by the cold and hard statistical facts.

In the first part, I will present the results of the article verbally without numbers. In the second part, I will present only the numbers.

Part 1:

In the article “I Like Myself but I Don’t Know Why: Enhancing Implicit Self-Esteem by
Subliminal Evaluative Conditioning” Ap Dijksterhuis reports the results of six studies (1-4, 5a, 5b).  All studies used a partially or fully subliminal evaluative conditioning task to influence implicit measures of self-esteem. The abstract states: “Participants were repeatedly presented with trials in which the word I was paired with positive trait terms. Relative to control conditions, this procedure enhanced implicit self-esteem.”  Study 1 used preferences for initials to measure implicit self-esteem. and “results confirmed the hypothesis that evaluative conditioning enhanced implicit self-esteem.” (p. 348). Study 2 modified the control condition and showed that “participants in the conditioned self-esteem condition showed higher implicit self-esteem after the treatment than before the treatment, relative to control participants” (p. 348).  Experiment 3 changed the evaluative conditioning procedure. Now, both the CS and the US (positive trait terms) were
presented subliminally for 17 ms.  It also used the Implicit Association Test to measure implicit self-esteem.  The results showed that “difference in response latency between blocks was much more pronounced in the conditioned self-esteem condition, indicating higher self-esteem” (p. 349).  Study 4 also showed that “participants in the conditioned self-esteem condition exhibited higher implicit self-esteem than participants
in the control condition” (p. 350).  Study 5a and 5b showed that “individuals whose
self-esteem was enhanced seemed to be insensitive to personality feedback, whereas control participants whose self-esteem was not enhanced did show effects of the intelligence feedback.” (p. 352).  The General Discussion section summarizes the results. “In our experiments, implicit self-esteem was enhanced through subliminal evaluative conditioning. Pairing the self-depicting word I with positive trait terms consistently improved implicit self-esteem.” (p. 352).  A final conclusion section points out the potential of this work for enhancing self-esteem. “It is worthwhile to explicitly mention an intriguing aspect of the present work. Implicit self-esteem can be enhanced, at least temporarily, subliminally in about 25 seconds.” (p. 353).

 

Part 2:

Study Statistic p z OP
1 F(1,76)=5.15 0.026 2.22 0.60
2 F(1,33)=4.32 0.046 2.00 0.52
3 F(1,14)=8.84 0.010 2.57 0.73
4 F(1,79)=7.45 0.008 2.66 0.76
5a F(1,89)=4.91 0.029 2.18 0.59
5b F(1,51)=4.74 0.034 2.12 0.56

All six studies produced statistically significant results. To achieve this outcome two conditions have to be met: (a) the effect exists and (b) sampling error is small to avoid a failed study  (i.e., a non-significant result even though the effect is real).   The probability of obtaining a significant result is called power. The last column shows observed power. Observed power can be used to estimate the actual power of the six studies. Median observed power is 60%.  With 60% power, we would expect that only 60% of the 6 studies (3.6 studies) produce a significant result, but all six studies show a significant result.  The excess of significant result shows that the results in this article present an overly positive picture of the robustness of the effect.  If these six studies were replicated exactly, we would not expect to obtain six significant results again.  Moreover, the inflation of significant results also leads to an inflation of the power estimate. The R-Index corrects for this inflation by subtracting the inflation rate (100% observed success rate – 60% median observed power) from the power estimate.  The R-Index is .60 – .40 = .20.  Results with such a low R-Index often do not replicate in independent replication attempts.

Another method to examine the replicability of these results is to examine the variability of the z-scores (second last column).  Each z-score reflects the strength of evidence against the null-hypothesis. Even if the same study is replicated, this measure will vary as a function of random sampling.  The expected variance is approximately 1 (the standard deviation of a standard normal distribution).  Low variance suggests that future studies will produce more variable results and with p-values close to .05, this means that future studies are expected to produce non-significant results.  This bias test is called the Test of Insufficient Variance (TIVA).  The variance of the z-scores is Var(z) = 0.07.  The probability of this restricted variance to occur by chance is p = .003 (1/300).

Based on these results, the statistical evidence presented in this article is questionable and does not provide support for the conclusion that subliminal evaluative conditioning can enhance implicit self-esteem.  Another problem with this conclusion is that implicit self-esteem measures have low reliability and low convergent validity.  As a result, we would not expect strong and consistent effects of any experimental manipulation on these measures.  Finally, even if a small and reliable effect could be obtained, it remains an open question whether this effect shows an effect on implicit self-esteem or whether the manipulation produces a systematic bias in the measurement of implicit self-esteem.  “It is not yet known how long the effects of this manipulation last. In addition, it is not yet
known whether people who could really benefit from enhanced self-esteem (i.e., people with problematically low levels of self-esteem) can benefit from subliminal conditioning techniques.” (p. 353).  12 years later, we may wonder whether these results have been replicated in other laboratories and whether these effects last more than a few minutes after the conditioning experiment.

If you like Part I better, feel free to boost your self-esteem here.

selfesteemboost

 

Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.

Peer-Reviews from Psychological Methods

Times are changing. Media are flooded with fake news and journals are filled with fake novel discoveries. The only way to fight bias and fake information is full transparency and openness.
 
Jerry Brunner and I wrote a paper that examined the validity of z-curve, the method underlying powergraphs, to Psychological Methods.

As soon as we submitted it, we made the manuscript and the code available. Nobody used the opportunity to comment on the manuscript. Now we got the official reviews.

We would like to thank the editor and reviewers for spending time and effort on reading (or at least skimming) our manuscript and writing comments.  Normally, this effort would be largely wasted because like many other authors we are going to ignore most of their well-meaning comments and suggestions and try to publish the manuscript mostly unchanged somewhere else. As the editor pointed out, we are hopeful that our manuscript will eventually be published because 95% of written manuscripts get eventually published. So, why change anything.  However, we think the work of the editor and reviewers deserves some recognition and some readers of our manuscript may find them valuable. Therefore, we are happy to share their comments for readers interested in replicabilty and our method of estimating replicability from test statistics in original articles. 

 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Dear Dr. Brunner,

I have now received the reviewers’ comments on your manuscript. Based on their analysis and my own evaluation, I can no longer consider this manuscript for publication in Psychological Methods. There are two main reasons that I decided not to accept your submission. The first deals with the value of your statistical estimate of replicability. My first concern is that you define replicability specifically within the context of NHST by focusing on power and p-values. I personally have fewer problems with NHST than many methodologists, but given the fact that the literature is slowly moving away from this paradigm, I don’t think it is wise to promote a method to handle replicability that is unusable for studies that are conducted outside of it. Instead of talking about replicability as estimating the probability of getting a significant result, I think it would be better to define it in more continuous terms, focusing on how similar we can expect future estimates (in terms of effect sizes) to be to those that have been demonstrated in the prior literature. I’m not sure that I see the value of statistics that specifically incorporate the prior sample sizes into their estimates, since, as you say, these have typically been inappropriately low.

Sure, it may tell you the likelihood of getting significant results if you conducted a replication of the average study that has been done in the past. But why would you do that instead of conducting a replication that was more appropriately powered?

Reviewer 2 argues against the focus on original study/replication study distinction, which would be consistent with the idea of estimating the underlying distribution of effects, and from there selecting sample sizes that would produce studies of acceptable power. Reviewer 3 indicates that three of the statistics you discussed are specifically designed for single studies, and are no longer valid when applied to sets of studies, although this reviewer does provide information about how these can be corrected.

The second main reason, discussed by Reviewer 1, is that although your statistics may allow you to account for selection biases introduced by journals not accepting null results, they do not allow you to account for selection effects prior to submission. Although methodologists will often bring up the file drawer problem, it is much less of an issue than people believe. I read about a survey in a meta-analysis text (I unfortunately can’t remember the exact citation) that indicated that over 95% of the studies that get written up eventually get published somewhere. The journal publication bias against non-significant results is really more an issue of where articles get published, rather than if they get published. The real issue is that researchers will typically choose not to write up results that are non-significant, or will suppress non-significant findings when writing up a study with other significant findings. The latter case is even more complicated, because it is often not just a case of including or excluding significant results, but is instead a case where researchers examine the significant findings they have and then choose a narrative that makes best use of them, including non-significant findings when they are part of the story but excluding them when they are irrelevant. The presence of these author-side effects means that your statistic will almost always be overestimating the actual replicability of a literature.

The reviewers bring up a number of additional points that you should consider. Reviewer 1 notes that your discussion of the power of psychological studies is 25 years old, and therefore likely doesn’t apply. Reviewer 2 felt that your choice to represent your formulas and equations using programming code was a mistake, and suggests that you stick to standard mathematical notation when discussing equations. Reviewer 2 also felt that you characterized researcher behaviors in ways that were more negative than is appropriate or realistic, and that you should tone down your criticisms of these behaviors. As a grant-funded researcher, I can personally promise you that a great many researchers are concerned about power,since you cannot receive government funding without presenting detailed power analyses. Reviewer 2 noted a concern with the use of web links in your code, in that this could be used to identify individuals using your syntax. Although I have no suspicions that you are using this to keep track of who is reviewing your paper, you should remove those links to ensure privacy. Reviewer 1 felt that a number of your tables were not necessary, and both reviewers 2 and 3 felt that there were parts of your writing that could be notably condensed. You might consider going through the document to see if you can shorten it while maintaining your general points. Finally, reviewer 3 provides a great many specific comments that I feel would greatly enhance the validity and interpretability of your results. I would suggest that you attend closely to those suggestions before submitting to another journal.

For your guidance, I append the reviewers’ comments below and hope they will be useful to you as you prepare this work for another outlet.

Thank you for giving us the opportunity to consider your submission.

Sincerely, Jamie DeCoster, PhD
Associate Editor
Psychological Methods

 

Reviewers’ comments:

Reviewer #1:

The goals of this paper are admirable and are stated clearly here: “it is desirable to have an alternative method of estimating replicability that does not require literal replication. We see this method as complementary to actual replication studies.”

However, I am bothered by an assumption of this paper, which is that each study has a power (for example, see the first two paragraphs on page 20). This bothers me for several reasons. First, any given study in psychology will often report many different p-values. Second, there is the issue of p-hacking or forking paths. The p-value, and thus the power, will depend on the researcher’s flexibility in analysis. With enough researcher degrees of freedom, power approaches 100% no matter how small the effect size is. Power in a preregistered replication is a different story. The authors write, “Selection for significance (publication bias) does not change the power values of individual studies.” But to the extent that there is selection done _within_ a study–and this is definitely happening–I don’t think that quoted sentence is correct.

So I can’t really understand the paper as it is currently written, as it’s not clear to me what they are estimating, and I am concerned that they are not accounting for the p-hacking that is standard practice in published studies.

Other comments:

The authors write, “Replication studies ensure that false positives will be promptly discovered when replication studies fail to confirm the original results.” I don’t think “ensure” is quite right, since any replication is itself random. Even if the null is true, there is a 5% chance that a replication will confirm just by chance. Also many studies have multiple outcomes, and if any appears to be confirmed, this can be taken as a success. Also, replications will not just catch false positives, they will also catch cases where the null hypothesis is false but where power is low. Replication may have the _goal_ of catching false positives, but it is not so discriminating.

The Fisher quote, “A properly designed experiment rarely fails to give …significance,” seems very strange to me. What if an experiment is perfectly designed, but the null hypothesis happens to be true? Then it should have a 95% chance of _not_ giving significance.

The authors write, “Actual replication studies are needed because they provide more information than just finding a significant result again. For example, they show that the results can be replicated over time and are not limited to a specific historic, cultural context. They also show that the description of the original study was sufficiently precise to reproduce the study in a way that it successfully replicated the original result.” These statements seem too strong to me. Successful replication is rejection of the null, and this can happen even if the original study was not described precisely, etc.

The authors write, “A common estimate of power is that average power is about 50% (Cohen 1962, Sedlmeier and Gigerenzer 1989). This means that about half of the studies in psychology have less than 50% power.” I think they are confusing the mean with the median here. Also I would guess that 50% power is an overestimate. For one thing, psychology has changed a lot since 1962 or even 1989 so I see no reason to take this 50% guess seriously.

The authors write, “We define replicability as the probability of obtaining the same result in an exact replication study with the same procedure and sample sizes.” I think that by “exact” they mean “pre-registered” but this is not clear. For example, suppose the original study was p-hacked. Then, strictly speaking, an exact replication would also be p-hacked. But I don’t think that’s what the authors mean. Also, it might be necessary to restrict the definition to pre-registered studies with a single test. Otherwise there is the problem that a paper has several tests, and any rejection will be taken as a successful replication.

I recommend that the authors get rid of tables 2-15 and instead think more carefully about what information they would like to convey to the reader here.

Reviewer #2:

This paper is largely unclear, and in the areas where it is clear enough to decipher, it is unwise and unprofessional.

This study’s main claim seems to be: “Thus, statistical estimates of replicability and the outcome of replication studies can be seen as two independent methods that are expected to produce convergent evidence of replicability.” This is incorrect. The approaches are unrelated. Replication of a scientific study is part of the scientific process, trying to find out the truth. The new study is not the judge of the original article, its replicability, or scientific contribution. It is merely another contribution to the scientific literature. The replicator and the original article are equals; one does not have status above the other. And certainly a statistical method applied to the original article has no special status unless the method, data, or theory can be shown to be an improvement on the original article.

They write, “Rather than using traditional notation from Statistics that might make it difficult for non-statisticians to understand our method, we use computer syntax as notation.” This is a disqualifying stance for publication in a serious scholarly journal, and it would an embarrassment to any journal or author to publish these results. The point of statistical notation is clarity, generality, and cross-discipline understanding. Computer syntax is specific to the language adopted, is not general, and is completely opaque to anyone who uses a different computer language. Yet everyone who understands their methods will have at least seen, and needs to understand, statistical notation. Statistical (i.e., mathematical) notation is the one general language we have that spans the field and different fields. No computer syntax does this. Proofs and other evidence are expressed in statistical notation, not computer syntax in the (now largely unused) S statistical language. Computer syntax, as used in this paper, is also ill-defined in that any quantity defined by a primitive function of the language can change any time, even after publication, if someone changes the function. In fact, the S language, used in this paper, is not equivalent to R, and so the authors are incorrect that R will be more understandable. Not including statistical notation, when the language of the paper is so unclear and self-contradictory, is an especially unfortunate decision. (As it happens I know S and R, but I find the manuscript very difficult to understand without imputing my own views about what the authors are doing. This is unacceptable. It is not even replicable.) If the authors have claims to make, they need to state them in unambiguous mathematical or statistical language and then prove their claims. They do not do any of these things.

It is untrue that “researchers ignore power”. If they do, they will rarely find anything of interest. And they certainly write about it extensively. In my experience, they obsess over power, balancing whether they will find something with the cost of doing the experiment. In fact, this paper misunderstands and misrepresents the concept: Power is not “the long-run probability of obtaining a statistically significant result.” It is the probability that a statistical test will reject a false null hypothesis, as the authors even say explicitly at times. These are very different quantities.

This paper accuses “researchers” of many other misunderstandings. Most of these are theoretically incorrect or empirically incorrect.One point of the paper seems to be “In short, our goal is to estimate average power of a set of studies with unknown population effect sizes that can assume any value, including zero.” But I don’t see why we need to know this quantity or how the authors’ methods contribute to us knowing it. The authors make many statistical claims without statistical proofs, without any clear definition of what their claims are, and without empirical evidence. They use simulation that inquires about a vanishingly small portion of the sample space to substitute for an infinite domain of continuous parameter values; they need mathematical proofs but do not even state their claims in clear ways that are amenable to proof.

No coherent definition is given of the quantity of interest. “Effect size” is not generic and hypothesis tests are not invariant to the definition, even if it is true that they are monotone transformations of each other. One effect size can be “significant” and a transformation of the effect size can be “not significant” even if calculated from the same data. This alone invalidates the authors’ central claims.

The first 11.5 pages of this paper should be summarized in one paragraph. The rest does not seem to contribute anything novel. Much of it is incorrect as well. Better to delete throat clearing and get on with the point of the paper.

I’d also like to point out that the authors have hard-coded URL links to their own web site in the replication code. The code cannot be run without making a call to the author’s web site, and recording the reviewer’s IP address in the authors’ web logs. Because this enables the authors to track who is reviewing the manuscript, it is highly inappropriate. It also makes it impossible to replicate the authors results. Many journals (and all federal grants) have prohibitions on this behavior.

I haven’t checked whether Psychological Methods has this rule, but the authors should know better regardless.

Reviewer 3

Review of “How replicable is psychology? A comparison of four methods of estimating replicability on the bias of test statistics in original studies”

It was my pleasure to review this manuscript. The authors compare four methods of estimating replicability. One undeniable strength of the general approach is that these measures of replicability can be computed before or without actually replicating the study/studies. As such, one can see the replicability measure of a set of statistically significant findings as an index of trust in these findings, in the sense that the measure provides an estimate of the percentage of these studies that is expected to be statistically significant when replicating them under the same conditions and same sample size (assuming the replication study and the original study assess the same true effect). As such, I see value in this approach. However, I have many comments, major and minor, which will enable the authors to improve their manuscript.

Major comments

1. Properties of index.

What I miss, and what would certainly be appreciated by the reader, is a description of properties of the replicability index. This would include that it has a minimum value equal to 0.05 (or more generally, alpha), when the set of statistically significant studies has no evidential value. Its maximum value equals 1, when the power of studies included in the set was very large. A value of .8 corresponds to the situation where statistical power of the original situation was .8, as often recommended. Finally, I would add that both sample size and true effect size affect the replicability index; a high value of say .8 can be obtained when true effect size is small in combination with a large sample size (you can consider giving a value of N, here), or with a large true effect size in combination with a small sample size (again, consider giving values).

Consider giving a story like this early, e.g. bottom of page 6.

2. Too long explanations/text

Perhaps it is a matter of taste, but sometimes I consider explanations much too long. Readers of Psychological Methods may be expected to know some basics. To give you an example, the text on page 7 in “Introduction of Statistical Methods for Power estimation” is very long. I believe its four paragraphs can be summarized into just one; particularly the first one can be summarized in one or two sentences. Similarly, the section on “Statistical Power” can be shortened considerably, imo. Other specific suggestions for shortening the text, I mention below in the “minor comments” section. Later on I’ll provide one major comment on the tables, and how to remove a few of them and how to combine several of them.

3. Wrong application of ML, p-curve, p-uniform

This is THE main comment, imo. The problem is that ML (Hedges, 1984), p-curve, p-uniform, enable the estimation of effect size based on just ONE study. Moreover,  Simonsohn (p-curve) as well as the authors of p-uniform would argue against estimating the average effect size of unrelated studies. These methods are meant to meta-analyze studies on ONE topic.

4. P-uniform and p-curve section, and ML section

This section needs a major revision. First, I would start the section with describing the logic of the method. Only statistically significant results are selected. Conditional on statistical significance, the methods are based on conditional p-values (not just p-values), and then I would provide the formula on top of page 18. Most importantly, these techniques are not constructed for estimating effect size of a bunch of unrelated studies. The methods should be applied to related studies. In your case, to each study individually. See my comments earlier.

Ln(p), which you use in your paper, is not a good idea here for two reasons: (1) It is most sensitive to heterogeneity (which is also put forward by Van Assen et al (2014), and (2) applied to single studies it estimates effect size such that the conditional p-value equals 1/e, rather than .5  (resulting in less nice properties).

The ML method, as it was described, focuses on estimating effect size using one single study (see Hedges, 1984). So I was very surprised to see it applied differently by the authors. Applying ML in the context of this paper should be the same as p-uniform and p-curve, using exactly the same conditional probability principle. So, the only difference between the three methods is the method of optimization. That is the only difference.

You develop a set-based ML approach, which needs to assume a distribution of true effect size. As said before, I leave it up to you whether you still want to include this method. For now, I have a slight preference to include the set-based approach because it (i) provides a nice reference to your set-based approach, called z-curve, and (ii) using this comparison you can “test” how robust the set-based ML approach is against a violation of the assumption of the distribution of true effect size.

Moreover, I strongly recommend showing how their estimates differ for certain studies, and include this in a table. This allows you to explain the logic of the methods very well. Here a suggestion. I would provide the estimates of four methods (…) for p-values .04, .025, .01, .001, and perhaps .0001). This will be extremely insightful. For small p-values, the three methods&rsquo; estimates will be similar to the traditional estimate. For p-values > .025, the estimate will be negative, for p = .025 the estimate will be (close to) 0. Then, you can also use these same studies and p-values to calculate the power of a replication study (R-index).

I would exclude Figure 1, and the corresponding text. Is not (no longer) necessary.

For the set-based ML approach, if you still include it, please explain how you get to the true value distribution (g(theta)).

5a. The MA set, and test statistics

Many different effect sizes and test statistics exist. Many of them can be transformed to ONE underlying parameter, with a sensible interpretation and certain statistical properties. For instance, the chi2, t, and F(1,df) can all be transformed to d or r, and their SE can be derived. In the RPP project and by John et al (2016) this is called the MA set. Other test statistics, such as F(>1, df) cannot be converted to the same metric, and no SE is defined on that metric. Therefore, the statistics F(>1,df) were excluded from the meta-analyses in the RPP  (see the supplementary materials of the RPP) and by Johnson et al (2016) and also Morey and Lakens (2016), who also re-analyzed the data of the RPP.

Fortunately, in your application you do not estimate effect size but only estimate power of a test, which only requires estimating the ncp and not effect size. So, in principle you can include the F(>1,df) statistics in your analyses, which is a definite advantage. Although I can see you can incorporate it for the ML, p-curve, p-uniform approach, I do not directly see how these F(>1,df) statistics can be used for the two set-based methods (ML and z-curve); in the set-based methods, you put all statistics on one dimension (z) using the p-values. How do you defend this?

5b. Z-curve

Some details are not clear to me, yet. How many components (called r in your text) are selected, and why? Your text states: “First, select a ncp parameter m ; . Then generate Z from a normal distribution with mean m ; I do not understand, since the normal distribution does not have an ncp. Is it that you nonparametrically model the distribution of observed Z, with different components?

Why do you use kernel density estimation? What is it’s added value? Why making it more imprecise by having this step in between? Please explain.

Except for these details, procedure and logic of z-curve are clear

6. Simulations (I): test statistics

I have no reasons, theoretical or empirical, why the analyses would provide different results for Z, t, F(1,df), F(>1,df), chi2. Therefore, I would omit all simulation results of all statistics except 1, and not talk about results of these other statistics. For instance, in the simulations section I would state that results are provided on each of these statistics but present here only the results of t, and of others in supplementary info. When applying the methods to RPP, you apply them to all statistics simultaneously, which you could mention in the text (see also comment 4 above).

7. mean or median power (important)

One of my most important points is the assessment of replicability itself. Consider a set of studies for which replicability is calculated, for each study. So, in case of M studies, there are M replicability indices. Which statistics would be most interesting to report, i.e., are most informative? Note that the distribution of power is far some symmetrical, and actually may be bimodal with modes at 0.05 and 1.  For that reason alone, I would include in any report of replicability in a field the proportion of R-indices equal to 0.05 (which amounts to the proportion of results with .025 < p < .05) and the proportion of R-indices equal to 1.00 (e.g., using two decimals, i.e. > .995). Moreover, because power values are recommend of .8 or more, I also could include the proportion of studies with power > .8.

We also would need a measure of central tendency. Because the distribution is not symmetric, and may be skewed, I recommend using the median rather than the mean. Another reason to use the median rather than the mean is because the mean does not provide useable information on whether methods are biased or not, in the simulations. For instance, if true effect size = 0, because of sampling error the observed power will exceed .05 in exactly 50% of the cases (this is the case for p-uniform; since with probability .5 the p-value will exceed .025) and larger than .05 in the other 50% of the cases. Hence, the median will be exactly equal to .05, whereas the mean will exceed .05. Similarly, if true effect size is large the mean power will be too small (distribution skewed to the left). To conclude, I strongly recommend including the median in the results of the simulation.

In a report, such as for the RPP later on in the paper, I recommend including (i)

p(R=.05), (ii) p(R >= .8), (iii) p(R>= .995), (iv) median(R), (v) sd(R), (vi)

distribution R, (vii) mean R. You could also distinguish this for soc psy and cog psy.

8. simulations (II): selection of conditions

I believe it is unnatural to select conditions based on “mean true power” because we are most familiar with effect size and their distribution, and sample sizes and their distribution. I recommend describing these distributions, and then the implied power distribution (surely the median value as well, not or not only the mean).

9.  Omitted because it could reveal identity of reviewer

10. Presentation of results

I have comments on what you present, on how you present the results. First, what you present. For the ML and p-methods, I recommend presenting the distribution of R in each of the conditions (at least for fixed true effect size and fixed N, where results can be derived exactly relatively easy). For the set-based methods, if you focus on average R (which I do not recommend, I recommend median R), then present the RMSE. The absolute median error is minimized when you use the median. So, average-RMSE is a couple, and median-absolute error is a couple.

Now the presentation of results. Results of p-curve/p-uniform/ML are independent of the number of tests, but set-based methods (your ML variant) and z-curve are not.

Here the results I recommend presenting:

Fixed effect size, heterogeneity sample size

**For single-study methods, the probability distribution of R (figure), including mean(R), median(R), p(R=.05), p(R>= .995), sd(R). You could use simulation for approximating this distribution. Figures look like those in Figure 3, to the right.

**Median power, mean/sd as a function of K

**Bias for ML/p-curve/p-uniform amounts to the difference between median of distribution and the actual median, or the difference between the average of the distribution and the actual average. Note that this is different from set-based methods.

**For set-based methods, a table is needed (because of its dependence on k).

Results can be combined in one table (i.e., 2-3, 5-6, etc)

Significance tests comparing methods

I would exclude Table 4, Table 7, Table 10, Table 13. These significance tests do not make much sense. One method is better than another, or not – significance should not be relevant (for a very large number of iterations, a true difference will show up). You could simply describe in the text which method works best.

Heterogeneity in both sample size and effect size

You could provide similar results as for fixed effect size (but not for chi2, or other statistics). I would also use the same values of k as for the fixed effect case. For the fixed effect case you used 15, 25, 50, 100, 250. I can imagine using as values of k for both conditions k = 10, 30, 100, 400, 2,000 (or something).

Including the k = 10 case is important, because set-based methods will have more problems there, and because one paper or a meta-analysis or one author may have published just one or few statistically significant effect sizes. Note, however, that k=2,000 is only realistic when evaluating a large field.

Simulation of complex heterogeneity

Same results as for fixed effect size and heterogeneity in both sample size and effect size. Good to include a condition where the assumption of set-based ML is violated. I do not yet see why a correlation between N and ES may affect the results. Could you explain? For instance, for the ML/p-curve/p-uniform methods, all true effect sizes in combination with N result in a distribution of R for different studies; how this distribution is arrived at, is not relevant, so I do not yet see the importance of this correlation. That is, this correlation should only affect the results through the distribution of R. More reasoning should be provided, here.

Simulation of full heterogeneity

I am ambivalent about this section. If test statistic should not matter, then what is the added value of this section? Other distributions of sample size may be incorporated in previous section “complex heterogeneity”;. Other distributions of true effect may also be incorporated in the previous section. Note that Johnson et al (2016) use the RPP data to estimate that 90% of effects in psychology estimate a true zero effect. You assume only 10%.

Conservative bootstrap

Why only presenting the results of z-curve? By changing the limits of the interval, the interpretation becomes a bit awkward; what kind of interval is it now? Most importantly, coverages of .9973 or .9958 are horrible (in my opinion, these coverages are just as bad as coverages of .20). I prefer results of 95% confidence intervals, and then show their coverages in the table. Your &lsquo;conservative&rsquo; CIs are hard to interpret. Note also that this is paper on statistical properties of the methods, and one property is how well the methods perform w.r.t. 95% CI.

By the way, examining 95% CI of the methods is very valuable.

11. RPP

In my opinion, this section should be expanded substantially. This is where you can finally test your methodology, using real data! What I would add is the following: **Provide the distribution of R (including all statistics mentioned previously, i.e. p(R=0.05), p(R >= .8), p(R >= .995), median(R), mean(R), sd(R), using single-study methods **Provide previously mentioned results for soc psy and cog psy separately **Provide results of z-curve, and show your kernel density curve (strange that you never show this curve, if it is important in your algorithm).  What would be really great, is if you predict the probability of replication success (power) using the effect size estimate based on the original effect size estimated (derived from a single study) and the N of the replication sample. You could make a graph with on the X-axis this power, and on the Y-axis the result of the replication. Strong evidence in favor of your method would be if your result better predicts future replicability than any other index (see RPP for what they tried). Logistic regression seems to be the most appropriate technique for this.

Using multiple logistic regression, you can also assess if other indices have an added value above your predictions.

To conclude, for now you provide too limited results to convince readers that your approach is very useful.

Minor comments

P4 top: “heated debates” A few more sentence on this debate, including references to those debates would be fair. I would like to mention/recommend the studies of Maxwell et al (2015) in American Psychologist, the comment on the OSF piece in Science, and its response, and the very recent piece of Valen E Johnson et al (2016).

P4, middle: consider starting a new paragraph at “Actual replication”; In the sentence after this one, you may add “or not”;.

Another advantage of replication is that it may reveal heterogeneity (context dependence). Here, you may refer to the ManyLabs studies, which indeed reveal heterogeneity in about half of the replicated effects. Then, the next paragraph may start with “At the same time” To conclude, this piece starting with “Actual replication”; can be expanded a bit

P4, bottom,  “In contrast”; This and the preceding sentence is formulated as if sampling error does not exist. It is much too strong! Moreover, if the replication study had low power, sampling error is likely the reason of a statistically insignificant result. Here you can be more careful/precise. The last sentence of this paragraph is perfect.

P5, middle: consider adding more refs on estimates of power in psychology, e.g. Bakker and Wicherts 35% and that study on neuroscience with power estimates close to 20%. Last sentence of the same paragraph; assuming same true effect and same sample size.

P6, first paragraph around Rosenthal. Consider referring to the study of Johson et al (2016), who used a Bayesian analysis to estimate how many non-significant studies remain unpublished.

P7, top: &ldquo;studies have the same power (homogenous case) “(heterogenous case). This is awkward. Homogeneity and heterogeneity is generally reserved for variation in true effect size. Stick to that. Another problem here is that “heterogeneous”; power can be created by “heterogeneity”; in sample size and/or heterogeneity in effect size. These should be distinguished, because some methods can deal with heterogeneous power caused by heterogeneous N, but not heterogeneous true effect size. So, here, I would simple delete the texts between brackets.

P7, last sentence of first paragraph; I do not understand the sentence.

P10, “average power”. I did not understand this sentence.

P10, bottom: Why do you believe these methods to be most promising?

P11, 2nd par: Rephrase this sentence. Heterogeneity of effect size is not because of sampling variation. Later in this paragraph you also mix up heterogeneity with variation in power again. Of course, you could re-define heterogeneity, but I strongly recommend not doing so (in order not to confuse others); reserve heterogeneity to heterogeneity in true effect size.

P11, 3rd par, 1st sentence: I do not understand this sentence. But then again, this sentence may not be relevant (see major comments), because for applying p-uniform and p-curve heterogeneity of effect size is not relevant.

P11 bottom: maximum likelihood method. This sentence is not specific enough. But then again, this sentence may not be relevant (see major comments).

P12: Statistics without capital.

P12: “random sampling distribution”; delete “random”;. By the way, I liked this section on Notation and statistical background.

Section “Two populations of power”;. I believe this section is unnecessarily long, with a lot of text. Consider shortening. The spinning wheel analogy is ok.

P16, “close to the first” You mean second?

P16, last paragraph, 1st sentence: English?

Principle 2: The effect on what? Delete last sentence in the principle.

P17, bottom: include the average power after selection in your example.

p-curve/p-uniform: modify, as explained in one of the major comments.

P20, last sentence: Modify the sentence – the ML approach has excellent properties asymptotically, but not sample size is small. Now it states that it generally yields more precise estimates.

P25, last sentence of 4. Consider deleting this sentence (does not add anything useful).

P32: “We believe that a negative correlation between” some part of sentence is missing.

P38, penultimate sentence: explain what you mean by “decreasing the lower limit by .02”; and “increasing the upper limit by .02”;.