Z-Curve: Estimating Replicability of Published Results in Psychology (Revision)

Jerry Brunner and I developed two methods to estimate replicability of published results based on test statistics in original studies.  One method, z-curve, is used to provide replicabiltiy estimates in my powergraphs.

In September, we submitted a manuscript that describes these methods to Psychological Methods, where it was rejected.

We now revised the manuscript. The new manuscript contains a detailed discussion of various criteria for replicability with arguments why a significant result in an exact replication study is an important, if not the only, criterion to evaluate the outcome of replication studies.

It also makes a clear distinction between selection for significance in an original study and the file drawer problem in a series of conceptual or exact replication studies. Our methods only assumes selection for significance in original studies, but no file drawer or questionable research practices.  This idealistic assumption may explain why our model predicts a much higher success rate in the OSC reproducibility project (66%) than was actually obtained (36%).  As there is ample evidence for file-drawers with non-significant conceptual replication studies, we believe that file-drawers and QRP contribute to the low success rate in the OSC project. However, we also mention concerns about the quality of some replication studies.

We hope that the revised version is clearer, but fundamentally nothing has changed. Reviewers at Psychological Methods didn’t like our paper, the editor thought NHST is no longer relevant (see editorial letter and reviews), but nobody challenged our statistical method or the results of our simulation studies that validate the method. It works and it provides an estimate of replicability under very idealistic conditions, which means we can only expect a considerably lower success rate in actual replication studies as long as researchers file-drawer non-significant results.

Comments are welcome because we do not expect a straight acceptance from Perspectives on Psychological Science (LOL).

 

Murderer back on the crime scene

How did Diedrik Stapel Create Fake Results? A forensic analysis of “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations”

Diederik Stapel represents everything that has gone wrong in social psychology.  Until 2011, he was seen as a successful scientists who made important contributions to the literature on social priming.  In the article “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations” he presented 8 studies that showed that social comparisons can occur in response to stimuli that were presented witout awareness (subliminally).  The results were published in the top journal of social psychology published by the American Psychological Association (APA) and APA published a press-release for the general public about this work.  
In 2011, an investigation into Diedrik Stapel’s reserach practices revealed scientific fraud, which resulted in over 50 retractions (Retraction Watch), including the article on unconscious social comparisons (Retraction Notice).  In a book, Diederik Stapel told his story about his motives and practices, but the book is not detailed enough to explain how particular datasets were fabricated.  All we know, is that he used a number of different methods that range from making up datasets to the use of questionable research practices that increase the chance of producing a significant result.  These practices are widely used and are not considered scientific fraud, although the end result is the same. Published results no longer provide credible empirical evidence for the claims made in a published article.
I had two hypothesis. First, the data could be entirely made up. When researchers make up fake data they are likely to overestimate the real effect sizes and produce data that show the predicted pattern much more clearly than real data would. In this case, bias tests would not show a problem with the data.  The only evidence that the data are fake would be that the evidence is stronger than in other studies that relied on real data. 
 
In contrast, a researcher who starts with real data and then uses questionable practices is likely to use as little dishonest practices as possible because this makes it easier to justify the questionable decisions.  For example, removing 10% of data may seem justified, especially if some rational for exclusion can be found.  However, removing 60% of data cannot be justified.  The researcher will need to use these practices to produce the desired outcome, namely a p-value below .05 (or at least very close to .05).  As more use of questionable practices is not needed and harder to justify, the researcher will stop producing stronger evidence.  As a result, we would expect a large number of just significant results.
There are two bias tests that detect the latter form of fabricating significant results by means of questionable statistical methods; the Replicability-Index (R-Index) and the Test of Insufficient Variance (TIVA).   If Stapel used questionable statistical practices to produce just significant results, R-Index and TIVA would show evidence of bias.
The article reported 8 studies. The table shows the key finding of each study.
Study Statistic p z OP
1 F(1,28)=4.47 0.044 2.02 0.52
2A F(1,38)=4.51 0.040 2.05 0.54
2B F(1,32)=4.20 0.049 1.97 0.50
2C F(1,38)=4.13 0.049 1.97 0.50
3 F(1,42)=4.46 0.041 2.05 0.53
4 F(2,49)=3.61 0.034 2.11 0.56
5 F(1,29)=7.04 0.013 2.49 0.70
6 F(1,55)=3.90 0.053 1.93 0.49
All results were interpreted as evidence for an effect and the p-value for Study 6 was reported as p = .05.
All p-values are below .053 but greater than .01.  This is an unlikely outcome because sampling error should produce more variability in p-values.  TIVA examines whether there is insufficient variability.  First, p-values are converted into z-scores.  The variance of z-scores due to sampling error alone is expected to be approximately 1.  However, the observed variance is only Var(z) = 0.032.  A chi-square test shows that this observed variance is unlikely to occur by chance alone,  p = .00035. We would expect such an extremely small variabilty or even less variability in only 1 out of 28,458 sets of studies by chance alone.
 
The last column transforms z-scores into a measure of observed power. Observed power is an estimate of the probability of obtaining a significant result under the assumption that the observed effect size matches the population effect size.  These estimates are influenced by sampling error.  To get a more reliable estimate of the probability of a successful outcome, the R-Index uses the median. The median is 53%.  It is unlikely that a set of 8 studies with a 53% chance of obtaining a significant result produced significant results in all studies.  This finding shows that the reported success rate is not credible. To make matters worse, the probability of obtaining a significant result is inflated when a set of studies contains too many significant results.  To correct for this bias, the R-Index computes the inflation rate.  With 53% probability of success and 100% success rate, the inflation rate is 47%.  To correct for inflation, the inflation rate is subtracted from median observed probability, which yields an R-Index of 53% – 47% = 6%.  Based on this value, it is extremely unlikely that a researcher would obtain a significant result, if they would actually replicate the original studies exactly.  The published results show that Stapel could not have produced these results without the help of questionable methods, which also means nobody else can reproduce these results.
In conclusion, bias tests suggest that Stapel actually collected data and failed to find supporting evidence for his hypotheses.  He then used questionable practices until the results were statistically significant.  It seems unlikely that he outright faked these data and intentionally produced a p-value of .053 and reported it as p = .05.  However, statistical analysis can only provide suggestive evidence and only Stapel knows what he did to get these results.
swan-deadd7x_8796

A sarcastic comment on “Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology” by Harry Reis

“Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology”
Journal of Experimental Social Psychology 66 (2016) 148–152
DOI: http://dx.doi.org/10.1016/j.jesp.2016.01.005

a.k.a The Swan Song of Social Psychology During the Golden Age

Disclaimer: i wrote this piece because Jamie Pennebeker recommended writing as therapy to deal with trauma.  However, in his defense, he didn’t propose publishing the therapeutic writings.

————————————————————————-

You might think an article with reproducibiltiy in the title would have something to say about the replicability crisis in social psychology.  However, this article has very little to say about the causes of the replication crisis in social psychology and possible solutions to improve replicability. Instead, it appears to be a perfect example of repressive coping to avoid the traumatic realization that decades of work were fun, yet futile.

1. Introduction

The authors start with a very sensible suggestion. “We propose that the goal of achieving sound scientific insights and useful applications will be better facilitated over the long run by promoting good scientific practice rather than by stressing the need to prevent any and all mistakes.”  (p. 149).  The only question is how many mistakes we consider tolerable and that we do not know what the error rates are. Rosenthal pointed out it could be 100%, which even the authors might consider to be a little bit too high.

2. Improving research practice”

In this chapter, the authors suggest that “if there is anything on which all researchers might agree, it is the call for improving our research practices and techniques.” (p. 149).  If this were the case, we wouldn’t see articles in 2016 that make statistical mistakes that have been known for decades like pooling data from a heterogeneous set of studies or computing difference scores and using one of the variables as a predictor of the difference score.

It is also puzzling to read “the contemporary literature indicates just how central methodological innovation has been to advancing the field” (p. 149), when the key problem of low power has been known since 1962 and there is still no sign of improvement.

The authors also are not exactly in favor of adapting better methods, when these methods might reveal major problems in older studies.  For example, a meta-analysis in 2010 might not have examined publication bias and produced an effect size of more than half a standard deviation, when a new method that controls for publication bias finds that it is impossible to reject the null-hypothesis. No, these new methods are not welcome. “In our  view, they will stifle progress and innovation if they are seen primarily through the lens of maladaptive perfectionism; namely as ways of rectifying flaws and shortcomings in prior work.”  (p. 149).  So, what is the solution. Let’s pretend that subliminal priming made people walk slower in 1996, but stopped working in 2011?

This ends the chapter of improving research practice.  Yes, that is the way to deal with a crisis.  When the city is bankrupt, cut back on the Christmas decorations. Problem solved.

3. How to think about replications

Let’s start with a trivial statement that is as meaningless as saying, we would welcome more funding.  “Replications are valuable.” (p. 149).  Let’s also not mention that social psychologists have been the leader of requesting replication studies. No single study article shall be published in a social psychology journal. A minimum of three studies with conceptual replications of the key finding are needed to show that the results are robust and always produce significant results with p < .05 (or at least p < .10).  Yes, no other science has cherished replications as much as social psychology.

And eminent social psychologists Crandall and Sherman explain why. “to be a cumulative
and self-correcting enterprise, replications, be their results supportive, qualifying, or contradictory, must occur.”  Indeed, but what explains the 95% success rate of published replications in social psychology.  No need for self-correction, if the predictions are always confirmed.

Surprisingly, however, since 2011 a number of replication studies have been published in obscure journals that fail to replicate results.  This has never happened before and raises some concerns. What is going on here?  Why can these researchers not replicate the original results?  The answer is clear. They are doing it wrong.  “We concur with several authors (Crandall and Sherman, Stroebe) that conceptual replications offer the greatest potential to our field…  Much of the current debate, however, is focused narrowly on direct
or exact replications.” (p. 149). As philosopher know, you cannot step into the same river twice and so you cannot replicate the same study again.  To get a significant result, you need to do a similar, but not an identical replication study.

Another problem with failed replication studies is that these researchers assume that they are doing an exact replication study, but do not test this assumption. “In this light, Fabrigar’s insistence that researchers take more care to demonstrate psychometric invariance is well-placed” (p. 149).  Once more, the superiority of conceptual replication studies is self-evident. When you do a conceptual replication study, psychometric invariance is guaranteed and does not have to be demonstrated. Just one more reason, why conceptual replication studies in social psychology journals produce 95% success rate, whereas misguided exact replication attempts have failure rates of over 50%.

It is also important to consider the expertise of researchers.  Social psychologists often have demonstrated their expertise by publishing dozens of successful, conceptual replications.  In contrast, failed replications are often produced by novices with no track-record of ever producing a successful study.  These vast differences in previous success rate need to be taken into account in the evaluation of replication studies.  “Errors caused by low expertise or inadvertent changes are often catastrophic, in the sense of causing a study to fail completely, as Stroebe nicely illustrates.”

It would be a shame if psychology would start rewarding these replication studies.  Already limited research funds would be diverted to conducting studies that are easy to do, yet to difficult to do correctly for inexperienced researchers away from senior researchers who do difficult novel studies that always work and produced groundbreaking new insights into social phenomena during the “golden age” (p. 150) of social psychology.

The authors also point that failed studies are rarely failed studies. When these studies are properly combined with successful studies in a meta-analysis, the results nearly always show the predicted effect and that it was wrong to doubt original studies simply because replication studies failed to show the effect. “Deeper consideration of the terms “failed” and “underpowered” may reveal just how limited the field is by dichotomous thinking. “Failed” implies that a result at p = .06 is somehow inferior to one at p = .05, a conclusion
that scarcely merits disputation.” (p. 150).

In conclusion, we learn nothing from replication studies. They are a waste of time and resources and can only impede further development of social psychology by means of conceptual replication studies that build on the foundations laid during the “golden age” of social psychology.

4. Differential demands of different research topics

Some studies are easier to replicate than others, and replication failures might be “limited to studies that presented methodological challenges (i.e., that had protocols that were considered difficult to carry out) and that provided opportunities for experimenter bias” (p. 150).  It is therefore better, not to replicate difficult studies or to let original authors with a track-record of success conduct conceptual replication studies.

Moreover, some people have argued that the high succeess rate of original studies is inflated by publication bias (not writing up failed studies) and the use of questionable research practices (run more participants until p < .05).  To ensure that reported successes are real successes, some initiatives call for data sharing, pre-registration of data analysis plans, and a priori power analysis.  Although these may appear to be reasonable suggestions, the authors disagree.  “We worry that reifying any of the various proposals as a “best practice” for research integrity may marginalize researchers and research areas that study phenomena or use methods that have a harder time meeting these requirements.” (p. 150).

They appear to be concerns that researchers who do not preregister data analysis plans or do not share data may be stigmatized. “If not, such principles, no matter how well-intentioned, invite the possibility of discrimination, not only within the field but also by decision-makers who are not privy to these realities.”  (p. 150).

5. Considering broader implications

These are confusing times.  In the old days, the goal of research was clearly defined. Conduct at least three, loosely related , successful studies and write them up with a good story.  During these times, it was not acceptable to publish failed studies to maintain the 95% success rate. This made it hard for researchers who did not understand the rules of publishing only significant results. “Recently, a colleague of ours relayed his frustrating experience of submitting a manuscript that included one null-result study among several studies with statistically significant findings. He was met with rejection after rejection, all the while being told that the null finding weakened the results or confused the manuscript” (p. 151).

It is not clear what researchers should be doing now. Should they now report all of their studies, the good, the bad, and the ugly, or should they continue to present only the successful studies?   What if some researchers continue to publish the good old fashioned way that evolved during the golden age of social psychology and others try to publish results more in accordance with what actually happened in their lab?  “There is currently, a disconnect between what is good for scientists and what is good for science” and nobody is going to change while researchers who report only significant results get rewarded with publications in top journals.

 

 

 

 

 

There may also be little need to make major changes. “We agree with Crandall and Sherman, and also Stroebe, that social psychology is, like all sciences, a self-correcting enterprise” (p. 151).   And if social psychology is already self-correcting, it do not need new guidelines how to do research and new replication studies. Rather than instituting new policies, it might be better to make social psychology great again. Rather than publishing means and standard deviations or test statistics that allow data detectives to check results, it might be better to report only whether a result was significant, p < .05, and because 95% of studies are significant and the others are failed studies, we might simply not report any numbers.  False results will be corrected eventually because they will no longer be reported in journals and the old results might have been true even if they fail to replicate today.   The best approach is to fund researchers with a good track record of success and let them publish in the top journals.

 

Most likely, the replication crisis only exists in the imagination of overly self-critical psychologists. “Social psychologists are often reputed to be among the most severe critics of work within their own discipline” (p. 151).  A healthier attitude is to realize that “we already know a lot; with these practices, we can learn even more” (p. 151).

So, let’s get back to doing research and forget this whole thing that was briefly mentioned in the title called “concerns about reproducibility.”  Who cares that only 25% of social psychology studies from 2008 could be replicated in 2014.  In the meantime, thousands of new discoveries were made and it is time to make more new discoveries. “We should not get so caught up in perfectionistic concerns that they impede the rapid accumulation and dissemination of research findings” (p. 151).

There you have it folks.  Don’t worry about recent failed replications. This is just a normal part of science, especially a science that studies fragile, contextually sensitive phenomena. The results from 2008 do not necessarily replicate in 2014 and the results from 2014 may not replicate in 2018.  What we need is fewer replications. We need permanent research because many effects may disappear the moment they were discovered. This is what makes social psychology so exciting.  If you want to study stable phenomena that replicate decade after decade you might as well become a personality psychologist.

 

 

 

 

selfesteem

A replicability analysis of”I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning”

Dijksterhuis, A. (2004). I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning. JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY,   Volume: 86,   Issue: 2,   Pages: 345-355. 

DOI: 10.1037/0022-3514.86.2.345

There are a lot of articles with questionable statistical results and it seems pointless to single out particular articles.  However, once in a while, an article catches my attention and I will comment on the statistical results in it.  This is one of these articles….

The format of this review highlights why articles like this passed peer-review and are cited at high frequency as if they provided empirical facts.  The reason is a phenomenon called “verbal overshadowing.”   In work on eye-witness testimony, participants first see the picture of a perpetrator. Before the actual line-up task, they are asked to give a verbal description of the tasks.  The verbal description can distort the memory of the actual face and lead to a higher rate of misidentifications.  Something similar happens when researchers read articles. Sometimes they only read abstracts, but even when they read the article, the words can overshadow the actual empirical results. As a result, memory is more strongly influenced by verbal descriptions than by the cold and hard statistical facts.

In the first part, I will present the results of the article verbally without numbers. In the second part, I will present only the numbers.

Part 1:

In the article “I Like Myself but I Don’t Know Why: Enhancing Implicit Self-Esteem by
Subliminal Evaluative Conditioning” Ap Dijksterhuis reports the results of six studies (1-4, 5a, 5b).  All studies used a partially or fully subliminal evaluative conditioning task to influence implicit measures of self-esteem. The abstract states: “Participants were repeatedly presented with trials in which the word I was paired with positive trait terms. Relative to control conditions, this procedure enhanced implicit self-esteem.”  Study 1 used preferences for initials to measure implicit self-esteem. and “results confirmed the hypothesis that evaluative conditioning enhanced implicit self-esteem.” (p. 348). Study 2 modified the control condition and showed that “participants in the conditioned self-esteem condition showed higher implicit self-esteem after the treatment than before the treatment, relative to control participants” (p. 348).  Experiment 3 changed the evaluative conditioning procedure. Now, both the CS and the US (positive trait terms) were
presented subliminally for 17 ms.  It also used the Implicit Association Test to measure implicit self-esteem.  The results showed that “difference in response latency between blocks was much more pronounced in the conditioned self-esteem condition, indicating higher self-esteem” (p. 349).  Study 4 also showed that “participants in the conditioned self-esteem condition exhibited higher implicit self-esteem than participants
in the control condition” (p. 350).  Study 5a and 5b showed that “individuals whose
self-esteem was enhanced seemed to be insensitive to personality feedback, whereas control participants whose self-esteem was not enhanced did show effects of the intelligence feedback.” (p. 352).  The General Discussion section summarizes the results. “In our experiments, implicit self-esteem was enhanced through subliminal evaluative conditioning. Pairing the self-depicting word I with positive trait terms consistently improved implicit self-esteem.” (p. 352).  A final conclusion section points out the potential of this work for enhancing self-esteem. “It is worthwhile to explicitly mention an intriguing aspect of the present work. Implicit self-esteem can be enhanced, at least temporarily, subliminally in about 25 seconds.” (p. 353).

 

Part 2:

Study Statistic p z OP
1 F(1,76)=5.15 0.026 2.22 0.60
2 F(1,33)=4.32 0.046 2.00 0.52
3 F(1,14)=8.84 0.010 2.57 0.73
4 F(1,79)=7.45 0.008 2.66 0.76
5a F(1,89)=4.91 0.029 2.18 0.59
5b F(1,51)=4.74 0.034 2.12 0.56

All six studies produced statistically significant results. To achieve this outcome two conditions have to be met: (a) the effect exists and (b) sampling error is small to avoid a failed study  (i.e., a non-significant result even though the effect is real).   The probability of obtaining a significant result is called power. The last column shows observed power. Observed power can be used to estimate the actual power of the six studies. Median observed power is 60%.  With 60% power, we would expect that only 60% of the 6 studies (3.6 studies) produce a significant result, but all six studies show a significant result.  The excess of significant result shows that the results in this article present an overly positive picture of the robustness of the effect.  If these six studies were replicated exactly, we would not expect to obtain six significant results again.  Moreover, the inflation of significant results also leads to an inflation of the power estimate. The R-Index corrects for this inflation by subtracting the inflation rate (100% observed success rate – 60% median observed power) from the power estimate.  The R-Index is .60 – .40 = .20.  Results with such a low R-Index often do not replicate in independent replication attempts.

Another method to examine the replicability of these results is to examine the variability of the z-scores (second last column).  Each z-score reflects the strength of evidence against the null-hypothesis. Even if the same study is replicated, this measure will vary as a function of random sampling.  The expected variance is approximately 1 (the standard deviation of a standard normal distribution).  Low variance suggests that future studies will produce more variable results and with p-values close to .05, this means that future studies are expected to produce non-significant results.  This bias test is called the Test of Insufficient Variance (TIVA).  The variance of the z-scores is Var(z) = 0.07.  The probability of this restricted variance to occur by chance is p = .003 (1/300).

Based on these results, the statistical evidence presented in this article is questionable and does not provide support for the conclusion that subliminal evaluative conditioning can enhance implicit self-esteem.  Another problem with this conclusion is that implicit self-esteem measures have low reliability and low convergent validity.  As a result, we would not expect strong and consistent effects of any experimental manipulation on these measures.  Finally, even if a small and reliable effect could be obtained, it remains an open question whether this effect shows an effect on implicit self-esteem or whether the manipulation produces a systematic bias in the measurement of implicit self-esteem.  “It is not yet known how long the effects of this manipulation last. In addition, it is not yet
known whether people who could really benefit from enhanced self-esteem (i.e., people with problematically low levels of self-esteem) can benefit from subliminal conditioning techniques.” (p. 353).  12 years later, we may wonder whether these results have been replicated in other laboratories and whether these effects last more than a few minutes after the conditioning experiment.

If you like Part I better, feel free to boost your self-esteem here.

selfesteemboost

 

Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.

notlisteningape

Peer-Reviews from Psychological Methods

Times are changing. Media are flooded with fake news and journals are filled with fake novel discoveries. The only way to fight bias and fake information is full transparency and openness.
 
Jerry Brunner and I wrote a paper that examined the validity of z-curve, the method underlying powergraphs, to Psychological Methods.

As soon as we submitted it, we made the manuscript and the code available. Nobody used the opportunity to comment on the manuscript. Now we got the official reviews.

We would like to thank the editor and reviewers for spending time and effort on reading (or at least skimming) our manuscript and writing comments.  Normally, this effort would be largely wasted because like many other authors we are going to ignore most of their well-meaning comments and suggestions and try to publish the manuscript mostly unchanged somewhere else. As the editor pointed out, we are hopeful that our manuscript will eventually be published because 95% of written manuscripts get eventually published. So, why change anything.  However, we think the work of the editor and reviewers deserves some recognition and some readers of our manuscript may find them valuable. Therefore, we are happy to share their comments for readers interested in replicabilty and our method of estimating replicability from test statistics in original articles. 

 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Dear Dr. Brunner,

I have now received the reviewers’ comments on your manuscript. Based on their analysis and my own evaluation, I can no longer consider this manuscript for publication in Psychological Methods. There are two main reasons that I decided not to accept your submission. The first deals with the value of your statistical estimate of replicability. My first concern is that you define replicability specifically within the context of NHST by focusing on power and p-values. I personally have fewer problems with NHST than many methodologists, but given the fact that the literature is slowly moving away from this paradigm, I don’t think it is wise to promote a method to handle replicability that is unusable for studies that are conducted outside of it. Instead of talking about replicability as estimating the probability of getting a significant result, I think it would be better to define it in more continuous terms, focusing on how similar we can expect future estimates (in terms of effect sizes) to be to those that have been demonstrated in the prior literature. I’m not sure that I see the value of statistics that specifically incorporate the prior sample sizes into their estimates, since, as you say, these have typically been inappropriately low.

Sure, it may tell you the likelihood of getting significant results if you conducted a replication of the average study that has been done in the past. But why would you do that instead of conducting a replication that was more appropriately powered?

Reviewer 2 argues against the focus on original study/replication study distinction, which would be consistent with the idea of estimating the underlying distribution of effects, and from there selecting sample sizes that would produce studies of acceptable power. Reviewer 3 indicates that three of the statistics you discussed are specifically designed for single studies, and are no longer valid when applied to sets of studies, although this reviewer does provide information about how these can be corrected.

The second main reason, discussed by Reviewer 1, is that although your statistics may allow you to account for selection biases introduced by journals not accepting null results, they do not allow you to account for selection effects prior to submission. Although methodologists will often bring up the file drawer problem, it is much less of an issue than people believe. I read about a survey in a meta-analysis text (I unfortunately can’t remember the exact citation) that indicated that over 95% of the studies that get written up eventually get published somewhere. The journal publication bias against non-significant results is really more an issue of where articles get published, rather than if they get published. The real issue is that researchers will typically choose not to write up results that are non-significant, or will suppress non-significant findings when writing up a study with other significant findings. The latter case is even more complicated, because it is often not just a case of including or excluding significant results, but is instead a case where researchers examine the significant findings they have and then choose a narrative that makes best use of them, including non-significant findings when they are part of the story but excluding them when they are irrelevant. The presence of these author-side effects means that your statistic will almost always be overestimating the actual replicability of a literature.

The reviewers bring up a number of additional points that you should consider. Reviewer 1 notes that your discussion of the power of psychological studies is 25 years old, and therefore likely doesn’t apply. Reviewer 2 felt that your choice to represent your formulas and equations using programming code was a mistake, and suggests that you stick to standard mathematical notation when discussing equations. Reviewer 2 also felt that you characterized researcher behaviors in ways that were more negative than is appropriate or realistic, and that you should tone down your criticisms of these behaviors. As a grant-funded researcher, I can personally promise you that a great many researchers are concerned about power,since you cannot receive government funding without presenting detailed power analyses. Reviewer 2 noted a concern with the use of web links in your code, in that this could be used to identify individuals using your syntax. Although I have no suspicions that you are using this to keep track of who is reviewing your paper, you should remove those links to ensure privacy. Reviewer 1 felt that a number of your tables were not necessary, and both reviewers 2 and 3 felt that there were parts of your writing that could be notably condensed. You might consider going through the document to see if you can shorten it while maintaining your general points. Finally, reviewer 3 provides a great many specific comments that I feel would greatly enhance the validity and interpretability of your results. I would suggest that you attend closely to those suggestions before submitting to another journal.

For your guidance, I append the reviewers’ comments below and hope they will be useful to you as you prepare this work for another outlet.

Thank you for giving us the opportunity to consider your submission.

Sincerely, Jamie DeCoster, PhD
Associate Editor
Psychological Methods

 

Reviewers’ comments:

Reviewer #1:

The goals of this paper are admirable and are stated clearly here: “it is desirable to have an alternative method of estimating replicability that does not require literal replication. We see this method as complementary to actual replication studies.”

However, I am bothered by an assumption of this paper, which is that each study has a power (for example, see the first two paragraphs on page 20). This bothers me for several reasons. First, any given study in psychology will often report many different p-values. Second, there is the issue of p-hacking or forking paths. The p-value, and thus the power, will depend on the researcher’s flexibility in analysis. With enough researcher degrees of freedom, power approaches 100% no matter how small the effect size is. Power in a preregistered replication is a different story. The authors write, “Selection for significance (publication bias) does not change the power values of individual studies.” But to the extent that there is selection done _within_ a study–and this is definitely happening–I don’t think that quoted sentence is correct.

So I can’t really understand the paper as it is currently written, as it’s not clear to me what they are estimating, and I am concerned that they are not accounting for the p-hacking that is standard practice in published studies.

Other comments:

The authors write, “Replication studies ensure that false positives will be promptly discovered when replication studies fail to confirm the original results.” I don’t think “ensure” is quite right, since any replication is itself random. Even if the null is true, there is a 5% chance that a replication will confirm just by chance. Also many studies have multiple outcomes, and if any appears to be confirmed, this can be taken as a success. Also, replications will not just catch false positives, they will also catch cases where the null hypothesis is false but where power is low. Replication may have the _goal_ of catching false positives, but it is not so discriminating.

The Fisher quote, “A properly designed experiment rarely fails to give …significance,” seems very strange to me. What if an experiment is perfectly designed, but the null hypothesis happens to be true? Then it should have a 95% chance of _not_ giving significance.

The authors write, “Actual replication studies are needed because they provide more information than just finding a significant result again. For example, they show that the results can be replicated over time and are not limited to a specific historic, cultural context. They also show that the description of the original study was sufficiently precise to reproduce the study in a way that it successfully replicated the original result.” These statements seem too strong to me. Successful replication is rejection of the null, and this can happen even if the original study was not described precisely, etc.

The authors write, “A common estimate of power is that average power is about 50% (Cohen 1962, Sedlmeier and Gigerenzer 1989). This means that about half of the studies in psychology have less than 50% power.” I think they are confusing the mean with the median here. Also I would guess that 50% power is an overestimate. For one thing, psychology has changed a lot since 1962 or even 1989 so I see no reason to take this 50% guess seriously.

The authors write, “We define replicability as the probability of obtaining the same result in an exact replication study with the same procedure and sample sizes.” I think that by “exact” they mean “pre-registered” but this is not clear. For example, suppose the original study was p-hacked. Then, strictly speaking, an exact replication would also be p-hacked. But I don’t think that’s what the authors mean. Also, it might be necessary to restrict the definition to pre-registered studies with a single test. Otherwise there is the problem that a paper has several tests, and any rejection will be taken as a successful replication.

I recommend that the authors get rid of tables 2-15 and instead think more carefully about what information they would like to convey to the reader here.

Reviewer #2:

This paper is largely unclear, and in the areas where it is clear enough to decipher, it is unwise and unprofessional.

This study’s main claim seems to be: “Thus, statistical estimates of replicability and the outcome of replication studies can be seen as two independent methods that are expected to produce convergent evidence of replicability.” This is incorrect. The approaches are unrelated. Replication of a scientific study is part of the scientific process, trying to find out the truth. The new study is not the judge of the original article, its replicability, or scientific contribution. It is merely another contribution to the scientific literature. The replicator and the original article are equals; one does not have status above the other. And certainly a statistical method applied to the original article has no special status unless the method, data, or theory can be shown to be an improvement on the original article.

They write, “Rather than using traditional notation from Statistics that might make it difficult for non-statisticians to understand our method, we use computer syntax as notation.” This is a disqualifying stance for publication in a serious scholarly journal, and it would an embarrassment to any journal or author to publish these results. The point of statistical notation is clarity, generality, and cross-discipline understanding. Computer syntax is specific to the language adopted, is not general, and is completely opaque to anyone who uses a different computer language. Yet everyone who understands their methods will have at least seen, and needs to understand, statistical notation. Statistical (i.e., mathematical) notation is the one general language we have that spans the field and different fields. No computer syntax does this. Proofs and other evidence are expressed in statistical notation, not computer syntax in the (now largely unused) S statistical language. Computer syntax, as used in this paper, is also ill-defined in that any quantity defined by a primitive function of the language can change any time, even after publication, if someone changes the function. In fact, the S language, used in this paper, is not equivalent to R, and so the authors are incorrect that R will be more understandable. Not including statistical notation, when the language of the paper is so unclear and self-contradictory, is an especially unfortunate decision. (As it happens I know S and R, but I find the manuscript very difficult to understand without imputing my own views about what the authors are doing. This is unacceptable. It is not even replicable.) If the authors have claims to make, they need to state them in unambiguous mathematical or statistical language and then prove their claims. They do not do any of these things.

It is untrue that “researchers ignore power”. If they do, they will rarely find anything of interest. And they certainly write about it extensively. In my experience, they obsess over power, balancing whether they will find something with the cost of doing the experiment. In fact, this paper misunderstands and misrepresents the concept: Power is not “the long-run probability of obtaining a statistically significant result.” It is the probability that a statistical test will reject a false null hypothesis, as the authors even say explicitly at times. These are very different quantities.

This paper accuses “researchers” of many other misunderstandings. Most of these are theoretically incorrect or empirically incorrect.One point of the paper seems to be “In short, our goal is to estimate average power of a set of studies with unknown population effect sizes that can assume any value, including zero.” But I don’t see why we need to know this quantity or how the authors’ methods contribute to us knowing it. The authors make many statistical claims without statistical proofs, without any clear definition of what their claims are, and without empirical evidence. They use simulation that inquires about a vanishingly small portion of the sample space to substitute for an infinite domain of continuous parameter values; they need mathematical proofs but do not even state their claims in clear ways that are amenable to proof.

No coherent definition is given of the quantity of interest. “Effect size” is not generic and hypothesis tests are not invariant to the definition, even if it is true that they are monotone transformations of each other. One effect size can be “significant” and a transformation of the effect size can be “not significant” even if calculated from the same data. This alone invalidates the authors’ central claims.

The first 11.5 pages of this paper should be summarized in one paragraph. The rest does not seem to contribute anything novel. Much of it is incorrect as well. Better to delete throat clearing and get on with the point of the paper.

I’d also like to point out that the authors have hard-coded URL links to their own web site in the replication code. The code cannot be run without making a call to the author’s web site, and recording the reviewer’s IP address in the authors’ web logs. Because this enables the authors to track who is reviewing the manuscript, it is highly inappropriate. It also makes it impossible to replicate the authors results. Many journals (and all federal grants) have prohibitions on this behavior.

I haven’t checked whether Psychological Methods has this rule, but the authors should know better regardless.

Reviewer 3

Review of “How replicable is psychology? A comparison of four methods of estimating replicability on the bias of test statistics in original studies”

It was my pleasure to review this manuscript. The authors compare four methods of estimating replicability. One undeniable strength of the general approach is that these measures of replicability can be computed before or without actually replicating the study/studies. As such, one can see the replicability measure of a set of statistically significant findings as an index of trust in these findings, in the sense that the measure provides an estimate of the percentage of these studies that is expected to be statistically significant when replicating them under the same conditions and same sample size (assuming the replication study and the original study assess the same true effect). As such, I see value in this approach. However, I have many comments, major and minor, which will enable the authors to improve their manuscript.

Major comments

1. Properties of index.

What I miss, and what would certainly be appreciated by the reader, is a description of properties of the replicability index. This would include that it has a minimum value equal to 0.05 (or more generally, alpha), when the set of statistically significant studies has no evidential value. Its maximum value equals 1, when the power of studies included in the set was very large. A value of .8 corresponds to the situation where statistical power of the original situation was .8, as often recommended. Finally, I would add that both sample size and true effect size affect the replicability index; a high value of say .8 can be obtained when true effect size is small in combination with a large sample size (you can consider giving a value of N, here), or with a large true effect size in combination with a small sample size (again, consider giving values).

Consider giving a story like this early, e.g. bottom of page 6.

2. Too long explanations/text

Perhaps it is a matter of taste, but sometimes I consider explanations much too long. Readers of Psychological Methods may be expected to know some basics. To give you an example, the text on page 7 in “Introduction of Statistical Methods for Power estimation” is very long. I believe its four paragraphs can be summarized into just one; particularly the first one can be summarized in one or two sentences. Similarly, the section on “Statistical Power” can be shortened considerably, imo. Other specific suggestions for shortening the text, I mention below in the “minor comments” section. Later on I’ll provide one major comment on the tables, and how to remove a few of them and how to combine several of them.

3. Wrong application of ML, p-curve, p-uniform

This is THE main comment, imo. The problem is that ML (Hedges, 1984), p-curve, p-uniform, enable the estimation of effect size based on just ONE study. Moreover,  Simonsohn (p-curve) as well as the authors of p-uniform would argue against estimating the average effect size of unrelated studies. These methods are meant to meta-analyze studies on ONE topic.

4. P-uniform and p-curve section, and ML section

This section needs a major revision. First, I would start the section with describing the logic of the method. Only statistically significant results are selected. Conditional on statistical significance, the methods are based on conditional p-values (not just p-values), and then I would provide the formula on top of page 18. Most importantly, these techniques are not constructed for estimating effect size of a bunch of unrelated studies. The methods should be applied to related studies. In your case, to each study individually. See my comments earlier.

Ln(p), which you use in your paper, is not a good idea here for two reasons: (1) It is most sensitive to heterogeneity (which is also put forward by Van Assen et al (2014), and (2) applied to single studies it estimates effect size such that the conditional p-value equals 1/e, rather than .5  (resulting in less nice properties).

The ML method, as it was described, focuses on estimating effect size using one single study (see Hedges, 1984). So I was very surprised to see it applied differently by the authors. Applying ML in the context of this paper should be the same as p-uniform and p-curve, using exactly the same conditional probability principle. So, the only difference between the three methods is the method of optimization. That is the only difference.

You develop a set-based ML approach, which needs to assume a distribution of true effect size. As said before, I leave it up to you whether you still want to include this method. For now, I have a slight preference to include the set-based approach because it (i) provides a nice reference to your set-based approach, called z-curve, and (ii) using this comparison you can “test” how robust the set-based ML approach is against a violation of the assumption of the distribution of true effect size.

Moreover, I strongly recommend showing how their estimates differ for certain studies, and include this in a table. This allows you to explain the logic of the methods very well. Here a suggestion. I would provide the estimates of four methods (…) for p-values .04, .025, .01, .001, and perhaps .0001). This will be extremely insightful. For small p-values, the three methods&rsquo; estimates will be similar to the traditional estimate. For p-values > .025, the estimate will be negative, for p = .025 the estimate will be (close to) 0. Then, you can also use these same studies and p-values to calculate the power of a replication study (R-index).

I would exclude Figure 1, and the corresponding text. Is not (no longer) necessary.

For the set-based ML approach, if you still include it, please explain how you get to the true value distribution (g(theta)).

5a. The MA set, and test statistics

Many different effect sizes and test statistics exist. Many of them can be transformed to ONE underlying parameter, with a sensible interpretation and certain statistical properties. For instance, the chi2, t, and F(1,df) can all be transformed to d or r, and their SE can be derived. In the RPP project and by John et al (2016) this is called the MA set. Other test statistics, such as F(>1, df) cannot be converted to the same metric, and no SE is defined on that metric. Therefore, the statistics F(>1,df) were excluded from the meta-analyses in the RPP  (see the supplementary materials of the RPP) and by Johnson et al (2016) and also Morey and Lakens (2016), who also re-analyzed the data of the RPP.

Fortunately, in your application you do not estimate effect size but only estimate power of a test, which only requires estimating the ncp and not effect size. So, in principle you can include the F(>1,df) statistics in your analyses, which is a definite advantage. Although I can see you can incorporate it for the ML, p-curve, p-uniform approach, I do not directly see how these F(>1,df) statistics can be used for the two set-based methods (ML and z-curve); in the set-based methods, you put all statistics on one dimension (z) using the p-values. How do you defend this?

5b. Z-curve

Some details are not clear to me, yet. How many components (called r in your text) are selected, and why? Your text states: “First, select a ncp parameter m ; . Then generate Z from a normal distribution with mean m ; I do not understand, since the normal distribution does not have an ncp. Is it that you nonparametrically model the distribution of observed Z, with different components?

Why do you use kernel density estimation? What is it’s added value? Why making it more imprecise by having this step in between? Please explain.

Except for these details, procedure and logic of z-curve are clear

6. Simulations (I): test statistics

I have no reasons, theoretical or empirical, why the analyses would provide different results for Z, t, F(1,df), F(>1,df), chi2. Therefore, I would omit all simulation results of all statistics except 1, and not talk about results of these other statistics. For instance, in the simulations section I would state that results are provided on each of these statistics but present here only the results of t, and of others in supplementary info. When applying the methods to RPP, you apply them to all statistics simultaneously, which you could mention in the text (see also comment 4 above).

7. mean or median power (important)

One of my most important points is the assessment of replicability itself. Consider a set of studies for which replicability is calculated, for each study. So, in case of M studies, there are M replicability indices. Which statistics would be most interesting to report, i.e., are most informative? Note that the distribution of power is far some symmetrical, and actually may be bimodal with modes at 0.05 and 1.  For that reason alone, I would include in any report of replicability in a field the proportion of R-indices equal to 0.05 (which amounts to the proportion of results with .025 < p < .05) and the proportion of R-indices equal to 1.00 (e.g., using two decimals, i.e. > .995). Moreover, because power values are recommend of .8 or more, I also could include the proportion of studies with power > .8.

We also would need a measure of central tendency. Because the distribution is not symmetric, and may be skewed, I recommend using the median rather than the mean. Another reason to use the median rather than the mean is because the mean does not provide useable information on whether methods are biased or not, in the simulations. For instance, if true effect size = 0, because of sampling error the observed power will exceed .05 in exactly 50% of the cases (this is the case for p-uniform; since with probability .5 the p-value will exceed .025) and larger than .05 in the other 50% of the cases. Hence, the median will be exactly equal to .05, whereas the mean will exceed .05. Similarly, if true effect size is large the mean power will be too small (distribution skewed to the left). To conclude, I strongly recommend including the median in the results of the simulation.

In a report, such as for the RPP later on in the paper, I recommend including (i)

p(R=.05), (ii) p(R >= .8), (iii) p(R>= .995), (iv) median(R), (v) sd(R), (vi)

distribution R, (vii) mean R. You could also distinguish this for soc psy and cog psy.

8. simulations (II): selection of conditions

I believe it is unnatural to select conditions based on “mean true power” because we are most familiar with effect size and their distribution, and sample sizes and their distribution. I recommend describing these distributions, and then the implied power distribution (surely the median value as well, not or not only the mean).

9.  Omitted because it could reveal identity of reviewer

10. Presentation of results

I have comments on what you present, on how you present the results. First, what you present. For the ML and p-methods, I recommend presenting the distribution of R in each of the conditions (at least for fixed true effect size and fixed N, where results can be derived exactly relatively easy). For the set-based methods, if you focus on average R (which I do not recommend, I recommend median R), then present the RMSE. The absolute median error is minimized when you use the median. So, average-RMSE is a couple, and median-absolute error is a couple.

Now the presentation of results. Results of p-curve/p-uniform/ML are independent of the number of tests, but set-based methods (your ML variant) and z-curve are not.

Here the results I recommend presenting:

Fixed effect size, heterogeneity sample size

**For single-study methods, the probability distribution of R (figure), including mean(R), median(R), p(R=.05), p(R>= .995), sd(R). You could use simulation for approximating this distribution. Figures look like those in Figure 3, to the right.

**Median power, mean/sd as a function of K

**Bias for ML/p-curve/p-uniform amounts to the difference between median of distribution and the actual median, or the difference between the average of the distribution and the actual average. Note that this is different from set-based methods.

**For set-based methods, a table is needed (because of its dependence on k).

Results can be combined in one table (i.e., 2-3, 5-6, etc)

Significance tests comparing methods

I would exclude Table 4, Table 7, Table 10, Table 13. These significance tests do not make much sense. One method is better than another, or not – significance should not be relevant (for a very large number of iterations, a true difference will show up). You could simply describe in the text which method works best.

Heterogeneity in both sample size and effect size

You could provide similar results as for fixed effect size (but not for chi2, or other statistics). I would also use the same values of k as for the fixed effect case. For the fixed effect case you used 15, 25, 50, 100, 250. I can imagine using as values of k for both conditions k = 10, 30, 100, 400, 2,000 (or something).

Including the k = 10 case is important, because set-based methods will have more problems there, and because one paper or a meta-analysis or one author may have published just one or few statistically significant effect sizes. Note, however, that k=2,000 is only realistic when evaluating a large field.

Simulation of complex heterogeneity

Same results as for fixed effect size and heterogeneity in both sample size and effect size. Good to include a condition where the assumption of set-based ML is violated. I do not yet see why a correlation between N and ES may affect the results. Could you explain? For instance, for the ML/p-curve/p-uniform methods, all true effect sizes in combination with N result in a distribution of R for different studies; how this distribution is arrived at, is not relevant, so I do not yet see the importance of this correlation. That is, this correlation should only affect the results through the distribution of R. More reasoning should be provided, here.

Simulation of full heterogeneity

I am ambivalent about this section. If test statistic should not matter, then what is the added value of this section? Other distributions of sample size may be incorporated in previous section “complex heterogeneity”;. Other distributions of true effect may also be incorporated in the previous section. Note that Johnson et al (2016) use the RPP data to estimate that 90% of effects in psychology estimate a true zero effect. You assume only 10%.

Conservative bootstrap

Why only presenting the results of z-curve? By changing the limits of the interval, the interpretation becomes a bit awkward; what kind of interval is it now? Most importantly, coverages of .9973 or .9958 are horrible (in my opinion, these coverages are just as bad as coverages of .20). I prefer results of 95% confidence intervals, and then show their coverages in the table. Your &lsquo;conservative&rsquo; CIs are hard to interpret. Note also that this is paper on statistical properties of the methods, and one property is how well the methods perform w.r.t. 95% CI.

By the way, examining 95% CI of the methods is very valuable.

11. RPP

In my opinion, this section should be expanded substantially. This is where you can finally test your methodology, using real data! What I would add is the following: **Provide the distribution of R (including all statistics mentioned previously, i.e. p(R=0.05), p(R >= .8), p(R >= .995), median(R), mean(R), sd(R), using single-study methods **Provide previously mentioned results for soc psy and cog psy separately **Provide results of z-curve, and show your kernel density curve (strange that you never show this curve, if it is important in your algorithm).  What would be really great, is if you predict the probability of replication success (power) using the effect size estimate based on the original effect size estimated (derived from a single study) and the N of the replication sample. You could make a graph with on the X-axis this power, and on the Y-axis the result of the replication. Strong evidence in favor of your method would be if your result better predicts future replicability than any other index (see RPP for what they tried). Logistic regression seems to be the most appropriate technique for this.

Using multiple logistic regression, you can also assess if other indices have an added value above your predictions.

To conclude, for now you provide too limited results to convince readers that your approach is very useful.

Minor comments

P4 top: “heated debates” A few more sentence on this debate, including references to those debates would be fair. I would like to mention/recommend the studies of Maxwell et al (2015) in American Psychologist, the comment on the OSF piece in Science, and its response, and the very recent piece of Valen E Johnson et al (2016).

P4, middle: consider starting a new paragraph at “Actual replication”; In the sentence after this one, you may add “or not”;.

Another advantage of replication is that it may reveal heterogeneity (context dependence). Here, you may refer to the ManyLabs studies, which indeed reveal heterogeneity in about half of the replicated effects. Then, the next paragraph may start with “At the same time” To conclude, this piece starting with “Actual replication”; can be expanded a bit

P4, bottom,  “In contrast”; This and the preceding sentence is formulated as if sampling error does not exist. It is much too strong! Moreover, if the replication study had low power, sampling error is likely the reason of a statistically insignificant result. Here you can be more careful/precise. The last sentence of this paragraph is perfect.

P5, middle: consider adding more refs on estimates of power in psychology, e.g. Bakker and Wicherts 35% and that study on neuroscience with power estimates close to 20%. Last sentence of the same paragraph; assuming same true effect and same sample size.

P6, first paragraph around Rosenthal. Consider referring to the study of Johson et al (2016), who used a Bayesian analysis to estimate how many non-significant studies remain unpublished.

P7, top: &ldquo;studies have the same power (homogenous case) “(heterogenous case). This is awkward. Homogeneity and heterogeneity is generally reserved for variation in true effect size. Stick to that. Another problem here is that “heterogeneous”; power can be created by “heterogeneity”; in sample size and/or heterogeneity in effect size. These should be distinguished, because some methods can deal with heterogeneous power caused by heterogeneous N, but not heterogeneous true effect size. So, here, I would simple delete the texts between brackets.

P7, last sentence of first paragraph; I do not understand the sentence.

P10, “average power”. I did not understand this sentence.

P10, bottom: Why do you believe these methods to be most promising?

P11, 2nd par: Rephrase this sentence. Heterogeneity of effect size is not because of sampling variation. Later in this paragraph you also mix up heterogeneity with variation in power again. Of course, you could re-define heterogeneity, but I strongly recommend not doing so (in order not to confuse others); reserve heterogeneity to heterogeneity in true effect size.

P11, 3rd par, 1st sentence: I do not understand this sentence. But then again, this sentence may not be relevant (see major comments), because for applying p-uniform and p-curve heterogeneity of effect size is not relevant.

P11 bottom: maximum likelihood method. This sentence is not specific enough. But then again, this sentence may not be relevant (see major comments).

P12: Statistics without capital.

P12: “random sampling distribution”; delete “random”;. By the way, I liked this section on Notation and statistical background.

Section “Two populations of power”;. I believe this section is unnecessarily long, with a lot of text. Consider shortening. The spinning wheel analogy is ok.

P16, “close to the first” You mean second?

P16, last paragraph, 1st sentence: English?

Principle 2: The effect on what? Delete last sentence in the principle.

P17, bottom: include the average power after selection in your example.

p-curve/p-uniform: modify, as explained in one of the major comments.

P20, last sentence: Modify the sentence – the ML approach has excellent properties asymptotically, but not sample size is small. Now it states that it generally yields more precise estimates.

P25, last sentence of 4. Consider deleting this sentence (does not add anything useful).

P32: “We believe that a negative correlation between” some part of sentence is missing.

P38, penultimate sentence: explain what you mean by “decreasing the lower limit by .02”; and “increasing the upper limit by .02”;.

z

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Manuscript under review, copyright belongs to Jerry Brunner and Ulrich Schimmack

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Jerry Brunner and Ulrich Schimmack
University of Toronto @ Mississauga

Abstract
In the past five years, the replicability of original findings published in psychology journals has been questioned. We show that replicability can be estimated by computing the average power of studies. We then present four methods that can be used to estimate average power for a set of studies that were selected for significance: p-curve, p-uniform, maximum likelihood, and z-curve. We present the results of large-scale simulation studies with both homogeneous and heterogeneous effect sizes. All methods work well with homogeneous effect sizes, but only maximum likelihood and z-curve produce accurate estimates with heterogeneous effect sizes. All methods overestimate replicability using the Open Science Collaborative reproducibility project and we discuss possible reasons for this. Based on the simulation studies, we recommend z-curve as a valid method to estimate replicability. We also validated a conservative bootstrap confidence interval that makes it possible to use z-curve with small sets of studies.

Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, P-curve, P-uniform, Z-curve, Effect size, Replicability, Simulation.

Link to manuscript:  http://www.utstat.utoronto.ca/~brunner/zcurve2016/HowReplicable.pdf

Link to website with technical supplement:
http://www.utstat.utoronto.ca/~brunner/zcurve2016/