All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

P-REP (2005-2009): Reexamining the experiment to replace p-values with the probability of replicating an effect

In 2005, Psychological Science published an article titled “An Alternative to Null-Hypothesis Significance Tests” by Peter R. Killeen.    The article proposed to replace p-values and significance testing with a new statistic; the probability of replicating an effect (P-rep).  The article generated a lot of excitement and for a period from 2006 to 2009, Psychological Science encouraged reporting p-rep.   After some statistical criticism and after a new editor took over Psychological Science, interest in p-rep declined (see Figure).

It is ironic that only a few years later, psychological science would encounter a replication crisis, where several famous experiments did not replicate.  Despite much discussion about replicability of psychological science in recent years, Killeen’s attempt to predict replication outcome has been hardly mentioned.  This blog post reexamines p-rep in the context of the current replication crisis.

The abstract clearly defines p-rep as an estimate of “the probability of replicating an effect” (p. 345), which is the core meaning of replicability. Factories have high replicability (6 sigma) and produce virtually identical products that work with high probability. However, in empirical research it is not so easy to define what it means to get the same result. So, the first step in estimating replicability is to define the result of a study that a replication study aims to replicate.

“Traditionally, replication has been viewed as a second successful attainment of a significant effect” (Killeen, 2005, p. 349). Viewed from this perspective, p-rep would estimate the probability of obtaining a significant result (p < alpha) after observing a significant result in an original study.

Killeen proposes to change the criterion to the sign of the observed effect size. This implies that p-rep can only be applied to hypothesis with a directional hypothesis (e.g, it does not apply to tests of explained variance).  The criterion for a successful replication then becomes observing an effect size with the same sign as the original study.

Although this may appear like a radical change from null-hypothesis significance testing, this is not the case.  We can translate the sign criterion into an alpha level of 50% in a one-tailed t-test.  For a one-tailed t-test, negative effect sizes have p-values ranging from 1 to .50 and positive effect sizes have p-values ranging from .50 to 0.  So, a successful outcome is associated with a p-value below .50 (p < .50).

If we observe a positive effect size in the original study, we can compute the power of obtaining a positive result in a replicating study with a post-hoc power analysis, where we enter information about the standardized effect size, sample size, and alpha = .50, one-tailed.

Using R syntax this can be achieved with the formula:

Pt(obs.es/se,2-N)

with obs.es being the observed standardized effect size (Cohen’s d), N = total sample size, and se = sampling error = 2/sqrt(N).

The similarity to p-rep is apparent, when we look at the formula for p-rep.

Pnorm(obs.es/se/sqrt(2))

There are two differences. First, p-rep uses the standard normal distribution to estimate power. This is a simplification that ignores the degrees of freedom.  The more accurate formula for power is the non-central t-distribution that takes the degrees of freedom (N-2) into account.  However, even with modest sample sizes of N  =40, this simplification has negligible effects on power estimates.

The second difference is that p-rep reduces the non-centrality parameter (effect size/sampling error) by a factor of square-root 2.  Without going into the complex reasoning behind this adjustment, the end-result of the adjustment is that p-rep will be lower than the standard power estimate.

Using Killeen’s example on page 347 with d = .5 and N = 20, p-rep = .785.  In contrast, the power estimate with alpha = .50 is .861.

The comparison of p-rep with standard power analysis brings up an interesting and unexplored question. “Does p-rep really predict the probability of replication?”  (p. 348).  Killeen (2005) uses meta-analyses to answer this question.  In one example, he found that 70% of studies showed a negative relation between heart rate and aggressive behaviors.  The median value of p-rep over those studies was 71%.  Two other examples are provided.

A better way to evaluate estimates of replicability is to conduct simulation studies where the true answer is known.  For example, a simulation study can simulate 1,000,000 exact replications of Killeen’s example with d = .5 and N = 20 and we can observe how many studies show a positive observed effect size.  In a single run of this simulation, 86,842 studies showed a positive sign. Median P-rep (.788) underestimates this actual success rate, whereas median observed power more closely predicts the observed success rate (.861).

This is not surprising.  Power analysis is designed to predict the long-term success rate given a population effect size, a criterion value, and sampling error.  The adjustment made by Killeen is unnecessary and leads to the wrong prediction.

P-rep applied to Single Studies

It is also peculiar to use meta-analyses to test the performance of p-rep because a meta-analysis implies that many studies have been conducted, whereas the goal of p-rep was to predict the outcome of a single replication study from the outcome of an original study.

This primary aim also explains the adjustment to the non-centrality parameter, which was based on the idea to add the sampling variances of the original and replication study.  Finally, Killeen clearly states that the goal of p-rep is to ignore population effect sizes and to define replicability as “an effect of the same sign as that found in the original experiment” (p. 346).  This is very different from power analysis, which estimates the probability of an effect of the same sign as the population effect size.

We can evaluate p-rep as a predictor of obtaining effect sizes with the same direction in two studies with another simulation study.  Assume that the effect size is d = .20 and the total sample size is also small (N = 20).  The median p-rep estimate is 62%.

The 2 x 2 table shows how often the effect sizes of the original study and the replication study match.

Negative Positive
Negative 11% 22%
Positive 22% 45%

The table shows that the original and replication study match only 45% of the time when the sign also matches the population effect size. Another 11% matches occur when the original and the replication study show the wrong sign and future replication studies are more likely to show the opposite effect size.  Although these cases meet the definition of replicability with the sign of the original study as criterion, it seems questionable to define a pair of studies that both show the wrong result as a successful replication.  Furthermore, the median p-rep estimate of 62% is inconsistent with the correctly matched cases (45%) or the total number of matched cases (45% + 11% = 56%).  In conclusion, it is neither sensible to define replicability as consistency between pairs of exact replication studies, nor does p-rep estimate this probability very well.

Can we fix it?

The previous examination of p-rep showed that it is essentially an observed power estimate with alpha = 50% and an attenuated non-centrality parameter.  Does this mean we can fix p-rep and turn it into a meaningful statistic?  In other words, is it meaningful to compute the probability that future replication studies will reveal the direction of the population effect size by computing power with alpha = 50%?

For example, a research finds an effect size of d = .4 with a total sample size of N = 100.  Using a standard t-test, the research can report the traditional p-value; p = .048.

Negative Positive
Negative 0% 2%
Positive 2% 96%

The simulation results show that the most observations show consistent signs in pairs of studies and are also consistent with the population effect size.  Median observed power, the new p-rep, is 98%. So, is a high p-rep value a good indicator that future studies will also produce a positive sign?

The main problem with observed power analysis is that it relies on the observed effect size as an estimate of the population effect size.  However, in small samples, the difference between observed effect sizes and population effect sizes can be large, which leads to very variable estimates of p-rep. One way to alert readers to the variability in replicability estimates is to provide a confidence interval around the estimate.  As p-rep is a function of the observed effect size, this is easily achieved by converting the lower and upper limit of the confidence interval around the effect size into a confidence interval for p-rep.  With d = .4 and N = 100 (sampling error = 2/sqrt(100) = .20), the confidence interval of effect sizes ranges from d = .008 to d = .792.  The corresponding p-rep values are 52% to 100%.

Importantly, a value of 50% is the lower bound for p-rep and corresponds to determining the direction of the effect by a coin toss.  In other words, the point estimate of replicability can be highly misleading because the observed effect size may be considerably lower than the population effect size.   This means that reporting the point-estimate of p-rep can give false assurance about replicability, while the confidence interval shows that there is tremendous uncertainty around this estimate.

Understanding Replication Failures

Killeen (2005) pointed out that it can be difficult to understand replication failures using the traditional criterion of obtaining a significant result in the replication study.  For example, the original study may have reported a significant result with p = .04 and the replication study produced a non-significant p-value of p = .06.  According to the criterion of obtaining a significant result in the replication study, this outcome is a disappointing failure.  Of course, there is no meaningful difference between p = .04 and p = .06. It just so happens that they are on opposite sides of an arbitrary criterion value.

Killeen suggests that we can avoid this problem by reporting p-rep.  However, p-rep just changes the arbitrary criterion value from p = .05 to d = 0.  It is still possible that a replication study will fail because the effect sizes do not match.  Whereas the effect size in an original study was d = .05, the effect size in the replication study was d = -.05.  In small samples, this is not a meaningful difference in effect sizes, but the outcome constitutes a replication failure.

There is simply no way around making mistakes in inferential statistics.  We can only try to minimize them at the expense of reducing sampling error.  By setting alpha to 50%, we are reducing type-II errors (failing to support a correct hypothesis) at the expense of increasing the risk of a type-I error (failing to accept the wrong hypothesis), but errors will be made.

P-rep and Publication Bias

Killeen (2005) points out another limitation of p-rep.  “One might, of course, be misled by a value of prep that itself cannot be replicated. This can be caused by publication bias against small or negative effects.” (p. 350).  Here we see the real problem of raising alpha to 50%.  If there is no effect (d = 0), one out of two studies will produce a positive result that can be published.  If 100 researchers test an interesting hypothesis in their lab, but only positive results will be published, approximately 50 articles will support a false conclusion, while 50 other articles that showed the opposite result will be hidden in file drawers.  A stricter alpha criterion is needed to minimize the rate of false inferences, especially when publication bias is present.

A counter-argument could be that researchers who find a negative result can also publish their results, because positive and negative results are equally publishable. However, this would imply that journals are filled with inconsistent results and research areas with small effects and small samples will publish nearly equal number of studies with positive and negative results. Each article would draw a conclusion based on the results of a single study and try to explain inconsistent with potential moderator variables.  By imposing a stricter criterion for sufficient evidence, published results are more consistent and more likely to reflect a true finding.  This is especially true, if studies have sufficient power to reduce the risk of type-II errors and if journals do not selectively report studies with positive results.

Does this mean estimating replicability is a bad idea?

Although Killeen’s (2005) main goal was to predict the outcome of a single replication study, he did explore how well median replicability estimates predicted the outcome of meta-analysis.  As aggregation across studies reduces sampling error, replicability estimates based on sets of studies can be useful to predict actual success rates in studies (Sterling et al., 1995).  The comparison of median observed power with actual success rates can be used to reveal publication bias (Schimmack, 2012) and median observed power is a valid predictor of future study outcomes in the absence of publication bias and for homogeneous sets of studies. More advanced methods even make it possible to estimate replicability when publication bias is present and when the set of studies is heterogenous (Brunner & Schimmack, 2016).  So, while p-rep has a number of shortcomings, the idea of estimating replicability deserves further attention.

Conclusion

The rise and fall of p-rep in the first decade of the 2000s tells an interesting story about psychological science.  In hindsight, the popularity of p-rep is consistent with an area that focused more on discoveries than on error control.  Ideally, every study, no matter how small, would be sufficient to support inferences about human behavior.  The criterion to produce a p-value below .05 was deemed an “unfortunate historical commitment to significance testing” (p. 346), when psychologists were only interested in the direction of the observed effect size in their sample.  Apparently, there was no need to examine whether the observed effect size in a small sample was consistent with a population effect size or whether the sign would replicate in a series of studies.

Although p-rep never replaced p-values (most published p-rep values convert into p-values below .05), the general principles of significance testing were ignored. Instead of increasing alpha, researchers found ways to lower p-values to meet the alpha = .05 criterion. A decade later, the consequences of this attitude towards significance testing are apparent.  Many published findings do not hold up when they are subjected to an actual replication attempt by researchers who are willing to report successes and failures.

In this emerging new era, it is important to teach a new generation of psychologists how to navigate the inescapable problem of inferential statistics: you will make errors. Either you falsely claim a discovery of an effect or you fail to provide sufficient evidence for an effect that does exist.  Errors are part of science. How many and what type of errors will be made depends on how scientists conduct their studies.

Advertisements

What would Cohen say? A comment on p < .005

Most psychologists are trained in Fisherian statistics, which has become known as Null-Hypothesis Significance Testing (NHST).  NHST compares an observed effect size against a hypothetical effect size. The hypothetical effect size is typically zero; that is, the hypothesis is that there is no effect.  The deviation of the observed effect size from zero relative to the amount of sampling error provides a test statistic (test statistic = effect size / sampling error).  The test statistic can then be compared to a criterion value. The criterion value is typically chosen so that only 5% of test statistics would exceed the criterion value by chance alone.  If the test statistic exceeds this value, the null-hypothesis is rejected in favor of the inference that an effect greater than zero was present.

One major problem of NHST is that non-significant results are not considered.  To address this limitation, Neyman and Pearson extended Fisherian statistic and introduced the concepts of type-I (alpha) and type-II (beta) errors.  A type-I error occurs when researchers falsely reject a true null-hypothesis; that is, they infer from a significant result that an effect was present, when there is actually no effect.  The type-I error rate is fixed by the criterion for significance, which is typically p < .05.  This means, that a set of studies cannot produce more than 5% false-positive results.  The maximum of 5% false positive results would only be observed if all studies have no effect. In this case, we would expect 5% significant results and 95% non-significant results.

The important contribution by Neyman and Pearson was to consider the complementary type-II error.  A type-II error occurs when an effect is present, but a study produces a non-significant result.  In this case, researchers fail to detect a true effect.  The type-II error rate depends on the size of the effect and the amount of sampling error.  If effect sizes are small and sampling error is large, test statistics will often be too small to exceed the criterion value.

Neyman-Pearson statistics was popularized in psychology by Jacob Cohen.  In 1962, Cohen examined effect sizes and sample sizes (as a proxy for sampling error) in the Journal of Abnormal and Social Psychology and concluded that there is a high risk of type-II errors because sample sizes are too small to detect even moderate effect sizes and inadequate to detect small effect sizes.  Over the next decades, methodologists have repeatedly pointed out that psychologists often conduct studies with a high risk to fail; that is, to provide empirical evidence for real effects (Sedlemeier & Gigerenzer, 1989).

The concern about type-II errors has been largely ignored by empirical psychologists.  One possible reason is that journals had no problem filling volumes with significant results, while rejecting 80% of submissions that also presented significant results.  Apparently, type-II errors were much less common than methodologists feared.

However, in 2011 it became apparent that the high success rate in journals was illusory. Published results were not representative of studies that were conducted. Instead, researchers used questionable research practices or simply did not report studies with non-significant results.  In other words, the type-II error rate was as high as methodologists suspected, but selection of significant results created the impression that nearly all studies were successful in producing significant results.  The influential “False Positive Psychology” article suggested that it is very easy to produce significant results without an actual effect.  This led to the fear that many published results in psychology may be false positive results.

Doubt about the replicability and credibility of published results has led to numerous recommendations for the improvement of psychological science.  One of the most obvious recommendations is to ensure that published results are representative of the studies that are actually being conducted.  Given the high type-II error rates, this would mean that journals would be filled with many non-significant and inconclusive results.  This is not a very attractive solution because it is not clear what the scientific community can learn from an inconclusive result.  A better solution would be to increase the statistical power of studies. Statistical power is simply the inverse of a type-II error (power = 1 – beta).  As power increases, studies with a true effect have a higher chance of producing a true positive result (e.g., a drug is an effective treatment for a disease). Numerous articles have suggested that researchers should increase power to increase replicability and credibility of published results (e.g., Schimmack, 2012).

In a recent article, a team of 72 authors proposed another solution. They recommended that psychologists should reduce the probability of a type-I error from 5% (1 out of 20 studies) to 0.5% (1 out of 200 studies).  This recommendation is based on the belief that the replication crisis in psychology reflects a large number of type-I errors.  By reducing the alpha criterion, the rate of type-I errors will be reduced from a maximum of 10 out of 200 studies to 1 out of 200 studies.

I believe that this recommendation is misguided because it ignores the consequences of a more stringent significance criterion on type-II errors.  Keeping resources and sampling error constant, reducing the type-I error rate increases the type-II error rate. This is undesirable because the actual type-II error is already large.

For example, a between-subject comparison of two means with a standardized effect size of d = .4 and a sample size of N = 100 (n = 50 per cell) has a 50% risk of a type-II error.  The risk of a type-II error raises to 80%, if alpha is reduced to .005.  It makes no sense to conduct a study with an 80% chance of failure (Tversky & Kahneman, 1971).  Thus, the call for a lower alpha implies that researchers will have to invest more resources to discover true positive results.  Many researchers may simply lack the resources to meet this stringent significance criterion.

My suggestion is exactly opposite to the recommendation of a more stringent criterion.  The main problem for selection bias in journals is that even the existing criterion of p < .05 is too stringent and leads to a high percentage of type-II errors that cannot be published.  This has produced the replication crisis with large file-drawers of studies with p-values greater than .05,  the use of questionable research practices, and publications of inflated effect sizes that cannot be replicated.

To avoid this problem, researchers should use a significance criterion that balances the risk of a type-I and type-II error.  For example, with an expected effect size of d = .4 and N = 100, researchers should use p < .20 for significance, which reduces the risk of a type -II error to 20%.  In this case, type-I and type-II error are balanced.  If the study produces a p-value of, say, .15, researchers can publish the result with the conclusion that the study provided evidence for the effect. At the same time, readers are warned that they should not interpret this result as strong evidence for the effect because there is a 20% probability of a type-I error.

Given this positive result, researchers can then follow up their initial study with a larger replication study that allows for a stricter type-I error control, while holding power constant.   With d = 4, they now need N = 200 participants to have 80% power and alpha = .05.  Even if the second study does not produce a significant result (the probability that two studies with 80% power are significant is only 64%, Schimmack, 2012), researchers can combine the results of both studies and with N = 300, the combined studies have 80% power with alpha = .01.

The advantage of starting with smaller studies with a higher alpha criterion is that researchers are able to test risky hypothesis with a smaller amount of resources.  In the example, the first study used “only” 100 participants.  In contrast, the proposal to require p < .005 as evidence for an original, risky study implies that researchers need to invest a lot of resources in a risky study that may provide inconclusive results if it fails to produce a significant result.  A power analysis shows that a sample size of N = 338 participants is needed to have 80% power for an effect size of d = .4 and p < .005 as criterion for significance.

Rather than investing 300 participants into a risky study that may produce a non-significant and uninteresting result (eating green jelly beans does not cure cancer), researchers may be better able and willing to start with 100 participants and to follow up an encouraging result with a larger follow-up study.  The evidential value that arises from one study with 300 participants or two studies with 100 and 200 participants is the same, but requiring p < .005 from the start discourages risky studies and puts even more pressure on researchers to produce significant results if all of their resources are used for a single study.  In contrast, lowering alpha reduces the need for questionable research practices and reduces the risk of type-II errors.

In conclusion, it is time to learn Neyman-Pearson statistic and to remember Cohen’s important contribution that many studies in psychology are underpowered.  Low power produces inconclusive results that are not worthwhile publishing.  A study with low power is like a high-jumper that puts the bar too high and fails every time. We learned nothing about the jumpers’ ability. Scientists may learn from high-jump contests where jumpers start with lower and realistic heights and then raise the bar when they succeeded.  In the same manner, researchers should conduct pilot studies or risky exploratory studies with small samples and a high type-I error probability and lower the alpha criterion gradually if the results are encouraging, while maintaining a reasonably low type-II error.

Evidently, a significant result with alpha = .20 does not provide conclusive evidence for an effect.  However, the arbitrary p < .005 criterion also fails short of demonstrating conclusively that an effect exists.  Journals publish thousands of results a year and some of these results may be false positives, even if the error rate is set at 1 out of 200. Thus, p < .005 is neither defensible as a criterion for a first exploratory study, nor conclusive evidence for an effect.  A better criterion for conclusive evidence is that an effect can be replicated across different laboratories and a type-I error probability of less than 1 out of a billion (6 sigma).  This is by no means an unrealistic target.  To achieve this criterion with an effect size of d = .4, a sample size of N = 1,000 is needed.  The combined evidence of 5 labs with N = 200 per lab would be sufficient to produce conclusive evidence for an effect, but only if there is no selection bias.  Thus, the best way to increase the credibility of psychological science is to conduct studies with high power and to minimize selection bias.

This is what I believe Cohen would have said, but even if I am wrong about this, I think it follows from his futile efforts to teach psychologists about type-II errors and statistical power.

How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

Over the past five years, psychological science has been in a crisis of confidence.  For decades, psychologists have assumed that published significant results provide strong evidence for theoretically derived predictions, especially when authors presented multiple studies with internal replications within a single article (Schimmack, 2012). However, even multiple significant results provide little empirical evidence, when journals only publish significant results (Sterling, 1959; Sterling et al., 1995).  When published results are selected for significance, statistical significance loses its ability to distinguish replicable effects from results that are difficult to replicate or results that are type-I errors (i.e., the theoretical prediction was false).

The crisis of confidence led to several initiatives to conduct independent replications. The most informative replication initiative was conducted by the Open Science Collaborative (Science, 2015).  It replicated close to 100 significant results published in three high-ranked psychology journals.  Only 36% of the replication studies replicated a statistically significant result.  The replication success rate varied by journal.  The journal “Psychological Science” achieved a success rate of 42%.

The low success rate raises concerns about the empirical foundations of psychology as a science.  Without further information, a success rate of 42% implies that it is unclear which published results provide credible evidence for a theory and which findings may not replicate.  It is impossible to conduct actual replication studies for all published studies.  Thus, it is highly desirable to identify replicable findings in the existing literature.

One solution is to estimate replicability for sets of studies based on the published test statistics (e.g., F-statistic, t-values, etc.).  Schimmack and Brunner (2016) developed a statistical method, Powergraphs, that estimates the average replicability of a set of significant results.  This method has been used to estimate replicability of psychology journals using automatic extraction of test statistics (2016 Replicability Rankings, Schimmack, 2017).  The results for Psychological Science produced estimates in the range from 55% to 63% for the years 2010-2016 with an average of 59%.   This is notably higher than the success rate for the actual replication studies, which only produced 42% successful replications.

There are two explanations for this discrepancy.  First, actual replication studies are not exact replication studies and differences between the original and the replication studies may explain some replication failures.  Second, the automatic extraction method may overestimate replicability because it may include non-focal statistical tests. For example, significance tests of manipulation checks can be highly replicable, but do not speak to the replicability of theoretically important predictions.

To address the concern about automatic extraction of test statistics, I estimated replicability of focal hypothesis tests in Psychological Science with hand-coded, focal hypothesis tests.  I used three independent data sets.

Study 1

For Study 1, I hand-coded focal hypothesis tests of all studies in the 2008 Psychological Science articles that were used for the OSC reproducibility project (Science, 2015).

OSC.PS

The powergraphs show the well-known effect of publication bias in that most published focal hypothesis tests report a significant result (p < .05, two-tailed, z > 1.96) or at least a marginally significant result (p < .10, two-tailed or p < .05, one-tailed, z > 1.65). Powergraphs estimate the average power of studies with significant results on the basis of the density distribution of significant z-scores.  Average power is an estimate of replicabilty for a set of exact replication studies.  The left graph uses all significant results. The right graph uses only z-scores greater than 2.4 because questionable research practices may produce many just-significant results and lead to biased estimates of replicability. However, both estimation methods produce similar estimates of replicability (57% & 61%).  Given the small number of statistics the 95%CI is relatively wide (left graph: 44% to 73%).  These results are compatible with the low actual success rate for actual replication studies (42%) and the estimate based on automated extraction (59%).

Study 2

The second dataset was provided by Motyl et al. (JPSP, in press), who coded a large number of articles from social psychology journals and psychological science. Importantly, they coded a representative sample of Psychological Science studies from the years 2003, 2004, 2013, and 2014. That is, they did not only code social psychology articles published in Psychological Science.  The dataset included 281 test statistics from Psychological Science.

PS.Motyl

The powergraph looks similar to the powergraph in Study 1.  More important, the replicability estimates are also similar (57% & 52%).  The 95%CI for Study 1 (44% to 73%) and Study 2 (left graph: 49% to 65%) overlap considerably.  Thus, two independent coding schemes and different sets of studies (2008 vs. 2003-2004/2013/2014) produce very similar results.

Study 3

Study 3 was carried out in collaboration with Sivaani Sivaselvachandran, who hand-coded articles from Psychological Science published in 2016.  The replicability rankings showed a slight positive trend based on automatically extracted test statistics.  The goal of this study was to examine whether hand-coding would also show an increase in replicability.  An increase was expected based on an editorial by D. Stephen Linday, incoming editor in 2015, who aimed to increase replicability of results published in Psychological Science by introducing badges for open data and preregistered hypotheses. However, the results failed to show a notable increase in average replicability.

PS.2016

The replicability estimate was similar to those in the first two studies (59% & 59%).  The 95%CI ranged from 49% to 70%. These wide confidence intervals make it difficult to notice small improvements, but the histogram shows that just significant results (z = 2 to 2.2) are still the most prevalent results reported in Psychological Science and that non-significant results that are to be expected are not reported.

Combined Analysis 

Given the similar results in all three studies, it made sense to pool the data to obtain the most precise estimate of replicability of results published in Psychological Science. With 479 significant test statistics, replicability was estimated at 58% with a 95%CI ranging from 51% to 64%.  This result is in line with the estimated based on automated extraction of test statistics (59%).  The reason for the close match between hand-coded and automated results could be that Psych Science publishes short articles and authors may report mostly focal results because space does not allow for extensive reporting of other statistics.  The hand-coded data confirm that replicabilty in Psychological Science is likely to be above 50%.

PS.combined

It is important to realize that the 58% estimate is an average.  Powergraphs also show average replicability for segments of z-scores. Here we see that replicabilty for just-significant results (z < 2.5 ~ p > .01) is only 35%. Even for z-score between 2.5 and 3.0 (~ p > .001) is only 47%.  Once z-scores are greater than 3, average replicabilty is above 50% and with z-scores greater than 4, replicability is greater than 80%.  For any single study, p-values can vary greatly due to sampling error, but in general a published result with a p-value < .001 is much more likely to replicate than a p-value > .01 (see also OSC, Science, 2015).

Conclusion

This blog-post used hand-coding of test-statistics published in Psychological Science, the flagship journal of the Association for Psychological Science, to estimate replicabilty of published results.  Three dataset produced convergent evidence that the average replicabilty of exact replication studies is 58% +/- 7%.  This result is consistent with estimates based on automatic extraction of test statistics.  It is considerably higher than the success rate of actual replication studies in the OSC reproducibility project (42%). One possible reason for this discrepancy is that actual replication studies are never exact replication studies, which makes it more difficult to obtain statistical significance if the original studies are selected for significance. For example, the original study may have had an outlier in the experimental group that helped to produce a significant result. Not removing this outlier is not considered a questionable research practice, but an exact replication study will not reproduce the same outlier and may fail to reproduce a just-significant result.  More broadly, any deviation from the assumptions underlying the computation of test statistics will increase the bias that is introduced by selecting significant results.  Thus, the 58% estimate is an optimistic estimate of the maximum replicability under ideal conditions.

At the same time, it is important to point out that 58% replicability for Psychological Science does not mean psychological science is rotten to the core (Motyl et al., in press) or that most reported results are false (Ioannidis, 2005).  Even results that did not replicate in actual replication studies are not necessarily false positive results.  It is possible that more powerful studies would produce a significant result, but with a smaller effect size estimate.

Hopefully, these analyses will spur further efforts to increase replicability of published results in Psychological Science and in other journals.  We are already near the middle of 2017 and can look forward to the 2017 results.

 

 

 

How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press). 

Forthcoming article: 
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J., Sun, J., Washburn, A. N., Wong, K., Yantis, C. A., & Skitka, L. J. (in press). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. (preprint)

Brief Introduction

Since JPSP published incredbile evidence for mental time travel (Bem, 2011), the credibility of social psychological research has been questioned.  There is talk of a crisis of confidence, a replication crisis, or a credibility crisis.  However, hard data on the credibility of empirical findings published in social psychology journals are scarce.

There have been two approaches to examine the credibility of social psychology.  One approach relies on replication studies.  Authors attempt to replicate original studies as closely as possible.  The most ambitious replication project was carried out by the Open Science Collaboration (Science, 2015) that replicated 1 study from 100 articles; 54 articles were classified as social psychology.   For original articles that reported a significant result, only a quarter replicated a significant result in the replication studies.  This estimate of replicability suggests that researches conduct many more studies than are published and that effect sizes in published articles are inflated by sampling error, which makes them difficult to replicate. One concern about the OSC results is that replicating original studies can be difficult.  For example, a bilingual study in California may not produce the same results as a bilingual study in Canada.  It is therefore possible that the poor outcome is partially due to problems of reproducing the exact conditions of original studies.

A second approach is to estimate replicability of published results using statistical methods.  The advantage of this approach is that replicabiliy estimates are predictions for exact replication studies of the original studies because the original studies provide the data for the replicability estimates.   This is the approach used by Motyl et al.

The authors sampled 30% of articles published in 2003-2004 (pre-crisis) and 2013-2014 (post-crisis) from four major social psychology journals (JPSP, PSPB, JESP, and PS).  For each study, coders identified one focal hypothesis and recorded the statistical result.  The bulk of the statistics were t-values from t-tests or regression analyses and F-tests from ANOVAs.  Only 19 statistics were z-tests.   The authors applied various statistical tests to the data that test for the presence of publication bias or whether the studies have evidential value (i.e., reject the null-hypothesis that all published results are false positives).  For the purpose of estimating replicability, the most important statistic is the R-Index.

The R-Index has two components.  First, it uses the median observed power of studies as an estimate of replicability (i.e., the percentage of studies that should produce a significant result if all studies were replicated exactly).  Second, it computes the percentage of studies with a significant result.  In an unbiased set of studies, median observed power and percentage of significant results should match.  Publication bias and questionable research practices will produce more significant results than predicted by median observed power.  The discrepancy is called the inflation rate.  The R-Index subtracts the inflation rate from median observed power because median observed power is an inflated estimate of replicability when bias is present.  The R-Index is not a replicability estimate.  That is, an R-Index of 30% does not mean that 30% of studies will produce a significant result.  However, a set of studies with an R-Index of 30 will have fewer successful replications than a set of studies with an R-Index of 80.  An exception is an R-Index of 50, which is equivalent with a replicability estimate of 50%.  If the R-Index is below 50, one would expect more replication failures than successes.

Motyl et al. computed the R-Index separately for the 2003/2004 and the 2013/2014 results and found “the R-index decreased numerically, but not statistically over time, from .62 [CI95% = .54, .68] in 2003-2004 to .52 [CI95% = .47, .56] in 2013-2014. This metric suggests that the field is not getting better and that it may consistently be rotten to the core.”

I think this interpretation of the R-Index results is too harsh.  I consider an R-Index below 50 an F (fail).  An R-Index in the 50s is a D, and an R-Index in the 60s is a C.  An R-Index greater than 80 is considered an A.  So, clearly there is a replication crisis, but social psychology is not rotten to the core.

The R-Index is a simple tool, but it is not designed to estimate replicability.  Jerry Brunner and I developed a method that can estimate replicability, called z-curve.  All test-statistics are converted into absolute z-scores and a kernel density distribution is fitted to the histogram of z-scores.  Then a mixture model of normal distributions is fitted to the density distribution and the means of the normal distributions are converted into power values. The weights of the components are used to compute the weighted average power. When this method is applied only to significant results, the weighted average power is the replicability estimate;  that is, the percentage of significant results that one would expect if the set of significant studies were replicated exactly.   Motyl et al. did not have access to this statistical tool.  They kindly shared their data and I was able to estimate replicability with z-curve.  For this analysis, I used all t-tests, F-tests, and z-tests (k = 1,163).   The Figure shows two results.  The left figure uses all z-scores greater than 2 for estimation (all values on the right side of the vertical blue line). The right figure uses only z-scores greater than 2.4.  The reason is that just-significant results may be compromised by questionable research methods that may bias estimates.

Motyl.2d0.2d4

The key finding is the replicability estimate.  Both estimations produce similar results (48% vs. 49%).  Even with over 1,000 observations there is uncertainty in these estimates and the 95%CI can range from 45 to 54% using all significant results.   Based on this finding, it is predicted that about half of these results would produce a significant result again in a replication study.

However, it is important to note that there is considerable heterogeneity in replicability across studies.  As z-scores increase, the strength of evidence becomes stronger, and results are more likely to replicate.  This is shown with average power estimates for bands of z-scores at the bottom of the figure.   In the left figure,  z-scores between 2 and 2.5 (~ .01 < p < .05) have only a replicability of 31%, and even z-scores between 2.5 and 3 have a replicability below 50%.  It requires z-scores greater than 4 to reach a replicability of 80% or more.   Similar results are obtained for actual replication studies in the OSC reproducibilty project.  Thus, researchers should take the strength of evidence of a particular study into account.  Studies with p-values in the .01 to .05 range are unlikely to replicate without boosting sample sizes.  Studies with p-values less than .001 are likely to replicate even with the same sample size.

Independent Replication Study 

Schimmack and Brunner (2016) applied z-curve to the original studies in the OSC reproducibility project.  For this purpose, I coded all studies in the OSC reproducibility project.  The actual replication project often picked one study from articles with multiple studies.  54 social psychology articles reported 173 studies.   The focal hypothesis test of each study was used to compute absolute z-scores that were analyzed with z-curve.

OSC.soc

The two estimation methods (using z > 2.0 or z > 2.4) produced very similar replicability estimates (53% vs. 52%).  The estimates are only slightly higher than those for Motyl et al.’s data (48% & 49%) and the confidence intervals overlap.  Thus, this independent replication study closely replicates the estimates obtained with Motyl et al.’s data.

Automated Extraction Estimates

Hand-coding of focal hypothesis tests is labor intensive and subject to coding biases. Often studies report more than one hypothesis test and it is not trivial to pick one of the tests for further analysis.  An alternative approach is to automatically extract all test statistics from articles.  This makes it also possible to base estimates on a much larger sample of test results.  The downside of automated extraction is that articles also report statistical analysis for trivial or non-critical tests (e.g., manipulation checks).  The extraction of non-significant results is irrelevant because they are not used by z-curve to estimate replicability.  I have reported the results of this method for various social psychology journals covering the years from 2010 to 2016 and posted powergraphs for all journals and years (2016 Replicability Rankings).   Further analyses replicated the results from the OSC reproducibility project that results published in cognitive journals are more replicable than those published in social journals.  The Figure below shows that the average replicability estimate for social psychology is 61%, with an encouraging trend in 2016.  This estimate is about 10% above the estimates based on hand-coded focal hypothesis tests in the two datasets above.  This discrepancy can be due to the inclusion of less original and trivial statistical tests in the automated analysis.  However, a 10% difference is not a dramatic difference.  Neither 50% nor 60% replicability justify claims that social psychology is rotten to the core, nor do they meet the expectation that researchers should plan studies with 80% power to detect a predicted effect.

replicability-cog-vs-soc

Moderator Analyses

Motyl et al. (in press) did extensive coding of the studies.  This makes it possible to examine potential moderators (predictors) of higher or lower replicability.  As noted earlier, the strength of evidence is an important predictor.  Studies with higher z-scores (smaller p-values) are, on average, more replicable.  The strength of evidence is a direct function of statistical power.  Thus, studies with larger population effect sizes and smaller sampling error are more likely to replicate.

It is well known that larger samples have less sampling error.  Not surprisingly, there is a correlation between sample size and the absolute z-scores (r = .3).  I also examined the R-Index for different ranges of sample sizes.  The R-Index was the lowest for sample sizes between N = 40 and 80 (R-Index = 43), increased for N = 80 to 200 (R-Index = 52) and further for sample sizes between 200 and 1,000 (R-Index = 69).  Interestingly, the R-Index for small samples with N < 40 was 70.  This is explained by the fact that research designs also influence replicability and that small samples often use more powerful within-subject designs.

A moderator analysis with design as moderator confirms this.  The R-Indices for between-subject designs is the lowest (R-Index = 48) followed by mixed designs (R-Index = 61) and then within-subject designs (R-Index = 75).  This pattern is also found in the OSC reproducibility project and partially accounts for the higher replicability of cognitive studies, which often employ within-subject designs.

Another possibility is that articles with more studies package smaller and less replicable studies.  However,  number of studies in an article was not a notable moderator:  1 study R-Index = 53, 2 studies R-Index = 51, 3 studies R-Index = 60, 4 studies R-Index = 52, 5 studies R-Index = 53.

Conclusion 

Motyl et al. (in press) coded a large and representative sample of results published in social psychology journals.  Their article complements results from the OSC reproducibility project that used actual replications, but a much smaller number of studies.  The two approaches produce different results.  Actual replication studies produced only 25% successful replications.  Statistical estimates of replicability are around 50%.   Due to the small number of actual replications in the OSC reproducibility project, it is important to be cautious in interpreting the differences.  However, one plausible explanation for lower success rates in actual replication studies is that it is practically impossible to redo a study exactly.  This may even be true when researchers conduct three similar studies in their own lab and only one of these studies produces a significant result.  Some non-random, but also not reproducible, factor may have helped to produce a significant result in this study.  Statistical models assume that we can redo a study exactly and may therefore overestimate the success rate for actual replication studies.  Thus, the 50% estimate is an optimistic estimate for the unlikely scenario that a study can be replicated exactly.  This means that even though optimists may see the 50% estimate as “the glass half full,” social psychologists need to increase statistical power and pay more attention to the strength of evidence of published results to build a robust and credible science of social behavior.

 

 

Hidden Figures: Replication Failures in the Stereotype Threat Literature

In the past five years, it has become apparent that many classic and important findings in social psychology fail to replicate (Schimmack, 2016).  The replication crisis is often considered a new phenomenon, but failed replications are not entirely new.  Sometimes these studies have simply been ignored.  These studies deserve more attention and need to be reevaluated in the context of the replication crisis in social psychology.

In the past, failed replications were often dismissed because seminal articles were assumed to provide robust empirical support for a phenomenon, especially if an article presented multiple studies. The chance of reporting a false positive results in a multiple study article is low because the risk of a false positive decreases exponentially (Schimmack, 2012). However, the low risk of a false positive is illusory if authors only publish studies that worked. In this case, even false positives can be supported by significant results in multiple studies, as demonstrated in the infamous ESP study by Bem (2011).  As a result, publication bias undermines the reporting of statistical significance as diagnostic information about the risk of false positives (Sterling, 1959) and many important theories in social psychology rest on shaky empirical foundations that need to be reexamined.

Research on stereotype threat and women’s performance on math tests is one example where publication bias undermines the findings in a seminal study that produced a large literature of studies on gender differences in math performance. After correcting for publication bias, this literature shows very little evidence that stereotype threat has a notable and practically significant effect on women’s math performance (Flore & Wicherts, 2014).

Another important line of research has examined the contribution of stereotype threat to differences between racial groups on academic performance tests.  This blog post examines the strength of the empirical evidence for stereotype threat effects in the seminal article by Steele and Aronson (1995). This article is currently the 12th most cited article in the top journal for social psychology, Journal of Personality and Social Psychology (2,278 citations so far).

According to the abstract, “stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one’s group.” Studies 1 and 2 showed that “reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled).”  “Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it.”  “Study 4 showed that mere salience of the stereotype could impair Blacks’ performance even when the test was not
ability diagnostic.”

The results of Study 4 motivated Stricker and colleagues to examine the influence of stereotype-treat on test performance in a real-world testing situation.  These studies had large samples and were not limited to students at Stanford. One study was reported in a College Board Report (Stricker and Ward, 1998).   Another two studies were published in the Journal of Applied Social Psychology (Stricker & Ward, 2004).  This article received only 52 citations, although it reported two studies with an experimental manipulation of stereotype threat in a real assessment context.  One group of participants were asked about their gender or ethnicity before the text, the other group did not receive these questions.  As noted in the abstract, neither the inquiry about race, nor about gender, had a significant effect on test performance. In short, this study failed to replicate Study 4 of the classic and widely cited article by Steele and Aronson.

Stricker and Ward’s Abstract
Steele and Aronson (1995) found that the performance of Black research participants on
ability test items portrayed as a problem-solving task, in laboratory experiments, was affected adversely when they were asked about their ethnicity. This outcome was attributed to stereotype threat: Performance was disrupted by participants’ concerns about fulfilling the negative stereotype concerning Black people’s intellectual ability. The present field experiments extended that research to other ethnic groups and to males and females taking operational tests. The experiments evaluated the effects of inquiring about ethnicity and gender on the performance of students taking 2 standardized tests-the Advanced Placement Calculus AB Examination, and the Computerized Placement Tests-in actual test administrations. This inquiry did not have any effects on the test performance of Black, female, or other subgroups of students that were both statistically and practically significant.

The article also mentions a personal communication with Steele, in which Steele mentions an unpublished study that also failed to demonstrate the effect under similar conditions.

“In fact, Steele found in an unpublished pilot study that inquiring about ethnicity did not affect Black participants’ performance when the task was described as diagnostic of their ability (C. M. Steele, personal communication, May 2 1, 1997), in contrast to the
substantial effect of inquiring when the task was described as nondiagnostic.”

A substantive interpretation of this finding is that inquires about race or gender do not produce stereotype threat effects when a test is diagnostic because a diagnostic test already activates stereotype threat.  However, if this were a real moderator, it would be important to document this fact and it is not clear why this finding obtained in an earlier study by Steele remained unpublished. Moreover, it is premature to interpret the significant result in the published study with a non-diagnostic task and the non-significant result in an unpublished study with a diagnostic task as evidence that diagnosticity moderates the effect of the stereotype-threat manipulation. A proper test of this moderator hypothesis would require the demonstration of a three-way interaction between race, inquiry about race, and diagnosticity. Absent this evidence, it remains possible that diagnosticity is not a moderator and that the published result is a false positive (or a positive result with an inflated effect size estimate). In contrast, there appears to be consistent evidence that inquiries about race or gender before a real assessment of academic performance does not influence performance. This finding is not widely publicized, but is important for a better understanding of performance differences in real world settings.

The best way to examine the replicability of Steele and Aronson’s seminal finding with non-diagnostic tasks would be to conduct an exact replication study.  However, exact replication studies are difficult and costly.  An alternative is to examine the robustness of the published results by taking a closer look at the strength of the statistical results reported by Steele and Aronson, using modern statistical tests of publication bias and statistical power like the R-Index (Schimmack, 2014) and the Test of Insufficient Variance (TIVA, Schimmack, 2014).

Replicability Analysis of Steele and Aronson’s four studies

Study 1. The first study had a relatively large sample of N = 114 participants, but it is not clear how many of the participants were White or Black.  The study also had a 2 x 3 design, which leaves less than 20 participants per condition.   The study produced a significant main effect of condition, F(2, 107) = 4.74, and race, F(1,107) = 5.22, but the critical condition x race interaction was not significant (reported as p > .19).   However, a specific contrast showed significant differences between Black participants in the diagnostic condition and the non-diagnostic condition, t(107) = 2.88, p = .005, z = 2.82.  The authors concluded “in sum, then, the hypothesis was supported by the pattern of contrasts, but when tested over the whole design, reached only marginal significance” (p. 800).  In other words, Study 1 provided only weak support for the stereotype threat hypothesis.

Study 2. Study 2 eliminated one of the three experimental conditions. Participants were 20 Black and 20 White participants. This means there were only 10 participants in each condition of a 2 x 2 design. The degrees of freedom further indicate that the actual sample size was only 38 participants. Given the weak evidence in Study 1, there is no justification for a reduction in the number of participants per cell, although the difficulty of recruiting Black participants at Stanford may explain this inadequate sample size. Nevertheless, the study showed a significant interaction between race and test description, F(1,35) = 8.07, p = .007. The study also replicated the contrast from Study 1 that Black participants in the diagnostic condition performed significantly worse than Black participants in the non-diagnostic group, t(35) = 2.38, p = .023, z = 2.28.

Studies 1 and 2 are close replications of each other.  The consistent finding across the two studies that supports stereotype-treat theory is the finding that merely changing the description of an assessment task changes Black participants performance, as revealed by significant differences between the diagnostic and non-diagnostic condition in both studies.  The problem is that both studies had small numbers of Black participants and that small samples have low power to produce significant results. As a result, it is unlikely that a pair of studies would produce significant results in both studies.

Observed power  in the two studies is .81 and .62 with median observed power of .71. Thus, the actual success rate of 100% (2 out of 2 significant results) is 29 percentage points higher than the expected success rate. Moreover, when inflation is evident, median observed power is also inflated. To correct for this inflation, the Replicability-Index (R-Index) subtracts inflation from median observed power, which yields an R-Index of 42.  Any value below 50 is considered unacceptably low and I give it a letter grade F, just like students at American Universities receive an F for exams with less than 50% correct answers.  This does not mean that stereotype threat is not a valid theory or that there was no real effect in this pair of studies. It simply means that the evidence in this highly cited article is insufficient to make strong claims about the causes of Black’s performance on academic tests.

The Test of Insufficient Variance (TIVA) provides another way to examine published results.  Test statistics like t-values vary considerably from study to study even if the exact same study is conducted twice (or if one larger sample is randomly split into two sub-samples).  When test-statistics are converted into z-scores, sampling error (the random variability from sample to sample) follows approximately a standard normal distribution with a variance of 1.  If the variance is considerably smaller than 1, it suggests that the reported results represent a selected sample. Often the selection is a result of publication bias.  Applying TIVA to the pair of studies, yields a variance of Var(z) = 0.15.  As there are only two studies, it is possible that this outcome occurred by chance, p = .300, and it does not imply intentional selection for significance or other questionable research practices.  Nevertheless, it suggests that future replication studies will be more variable and produce some non-significant results.

In conclusion, the evidence presented in the first two studies is weaker than we might assume if we focused only on the fact that both studies produced significant contrasts. Given publication bias, the fact that both studies reported significant results provides no empirical evidence because virtually all published studies report significant results. The R-Index quantifies the strength of evidence for an effect while taking the influence of publication bias into account and it shows that the two studies with small samples provide only weak evidence for an effect.

Study 3.  This study did not examine performance. The aim was to demonstrate activation of stereotype threat with a sentence completion task.  The sample size of 68 participants  (35 Black, 33 White) implied that only 11 or 12 participants were assigned to one of the six cells in a 2 (race) by 3 (task description) design. The study produced main effects for race and condition, but most importantly it produced a significant interaction effect, F(2,61) = 3.30, p = .044.  In addition, Black participants in the diagnostic condition had more stereotype-related associations than Black participants in the non-diagnostic condition, t(61) = 3.53,

Study 4.  This study used inquiry about race to induce stereotype-threat. Importantly, the task was described as non-diagnostic (as noted earlier, a similar study produced no significant results when the task was described as diagnostic).  The design was a 2 x 2 design with 47 participants, which means only 11 or 12 participants were allocated to the four conditions.  The degrees of freedom indicated that cell frequencies were even lower. The study produced a significant interaction effect, F(1,39) = 7.82, p = .008.  The study also produced a significant contrast between Blacks in the race-prime condition and the no-prime condition, t(39) = 2.43, p = .020.

The contrast effect in Study 3 is strong, but it is not a performance measure.  If stereotype threat mediates the effect of task characteristics and performance, we would expect a stronger effect on the measure of the mediator than on the actual outcome of interest, task performance.  The key aim of stereotype threat theory is to explain differences in performance.  With a focus on performance outcomes, it is possible to examine the R-Index and TIVA of Studies 1, 2, and 4.  All three studies reported significant contrasts between Black students randomly assigned to two groups that were expected to show performance differences (Table 1).

Table 1

Study Test Statistic p-value z-score obs.pow
Study 1 t(107) = 2.88 0.005 2.82 0.81
Study 2 t(35)=2.38 0.023 2.28 0.62
Study 4 t(39) = 2.43 0.020 2.33 0.64

Median observed power is 64 and the R-Index is well below 50, 64 – 36 = 28 (F).  The variance in z-scores is Var(z) = 0.09, p = .086.  These results cast doubt about the replicability of the performance effects reported in Steele and Aronson’s seminal stereotype threat article.

Conclusion

Racial stereotypes and racial disparities are an important social issue.  Social psychology aims and promises to contribute to the understanding of this issue by conducting objective, scientific studies that can inform our understanding of these issues.  In order to live up to these expectations, social psychology has to follow the rules of science and listen to the data.  Just like it is important to get the numbers right to send men and women into space (and bring them back), it is important to get the numbers right when we use science to understand women and men on earth.  Unfortunately, social psychologists have not followed the examples of astronomers and the numbers do not add up.

The three African American women, features in this years movie “Hidden Figures”***,  Katherine Johnson, Dorothy Vaughan, and Mary Jackson might not approve of the casual way social psychologists use numbers in their research, especially the wide-spread practice of hiding numbers that do not match expectations.  No science that wants to make a real-world contribution can condone this practice.  It is also not acceptable to simply ignore published results from well-conducted studies with large samples that challenge a prominent theory.

Surely, the movie Hidden Figures dramatized some of the experiences of Black women at NASA, but there is little doubt that Katherine Johnson, Dorothy Vaughan, and Mary Jackson encountered many obstacles that might be considered stereotype threatening situations.  Yet, they prevailed and they paved the way for future generations of stereotyped groups.  Understanding racial and gender bias and performance differences remains an important issue and that is the reason why it is important to shed a light on hidden numbers and put simplistic theories under the microscope. Stereotype threat is too often used as a simple explanation that avoids tackling deeper and more difficult issues that cannot be easily studied in a quick laboratory experiment with undergraduate students at top research universities.  It is time for social psychologists to live up to its promises by tackling real world issues with research designs that have real world significance that produce real evidence using open and transparent research practices.

————————————————————————————————————————————

*** If you haven’t seen the movie, I highly recommend it.

 

Personalized Adjustment of p-values for publication bias

The logic of null-hypothesis significance testing is straightforward (Schimmack, 2017). The observed signal in a study is compared against the noise in the data due to sampling variation.  This signal to noise ratio is used to compute a probability; p-value.  If this p-value is below a threshold, typically p < .05,  it is assumed that the observed signal is not just noise and the null-hypothesis is rejected in favor of the hypothesis that the observed signal reflects a true effect.

NHST aims to keep the probability of a false positive discovery at a desirable rate. With p < .05, no more than 5% of ALL statistical tests can be false positives.  In other words, the long-run rate of false positive discoveries cannot exceed 5%.

The problem with the application of NHST in practice is that not all statistical results are reported. As a result, the rate of false positive discoveries can be much higher than 5% (Sterling, 1959; Sterling et al., 1995) and statistical significance no longer provides meaningful information about the probability of false positive results.

In order to produce meaningful statistical results it would be necessary to know how many statistical tests were actually performed to produce published significant results. This set of studies includes studies with non-significant results that remained unpublished. This set of studies is often called researchers’ file-drawer (Rosenthal, 1979).  Schimmack and Brunner (2016) developed a statistical method that estimates the size of researchers’ file drawer.  This makes it possible to correct reported p-values for publication bias so that p-values resume their proper function of providing statistical evidence about the probability of observing a false-positive result.

The correction process is first illustrated with a powergraph for statistical results reported in 103 journals in the year 2016 (see 2016 Replicability Rankings for more details).  Each test statistic is converted into an absolute z-score.  Absolute z-scores quantify the signal to noise ratio in a study.  Z-scores can be compared against the standard normal distribution that is expected from studies without an effect (the null-hypothesis).  A z-score of 1.96 (see red dashed vertical line in the graph) corresponds to the typical p < .05 (two-tailed) criterion.  The graph below shows that 63% of reported test statistics were statistically significant using this criterion.

All.2016.Ranking.Journals.Combined

Powergraphs use a statistical method, z-curve (Schimmack & Brunner, 2016) to model the distribution of statistically significant z-scores (z-scores > 1.96).  Based on the model results, it estimates how many non-significant results one would expect. This expected distribution is shown with the grey curve in the figure. The grey curve overlaps with the green and black curve. It is clearly visible that the estimated number of non-significant results is much larger than the actually reported number of non-significant results (the blue bars of z-scores between 0 and 1.96).  This shows the size of the file-drawer.

Powergraphs provide important information about the average power of studies in psychology.  Power is the average probability of obtaining a statistically significant result in the set of all statistical tests that were conducted, including the file drawer.  The estimated power is 39%.  This estimate is consistent with other estimates of power (Cohen, 1962; Sedlmeier & Gigerenzer, 1989), and below the acceptable minimum of 50% (Tversky and Kahneman, 1971).

Powergraphs also provide important information about the replicability of significant results. A published significant result is used to support the claim of a discovery. However, even a true discovery may not be replicable if the original study had low statistical power. In this case, it is likely that a replication study produces a false negative result; it fails to affirm the presence of an effect with p < .05, even though an effect actually exists. The powergraph estimate of replicability is 70%.  That is, any randomly drawn significant effect published in 2016 has only a 70% chance of reproducing a significant result again in an exact replication study.

Importantly, replicability is not uniform across all significant results. Replicabilty increases with the signal to noise ratio (Open Science Collective, 2015). In 2017 powergraphs were enhanced by providing information about the replicability for different levels of strength of evidence. In the graph below, z-scores between 0 and 6 are divided into 12 categories with a width of 0.5 standard deviations (0-0.5, 0.5-1, …. 5.5-6). For significant results, these values are the average replicability for z-scores in the specified range.

The graph shows a replicability estimate of 46% for z-scores between 2 and 2.5. Thus, a z-score greater than 2.5 is needed to meet the minimum standard of 50% replicability.  More important, these power values can be converted into p-values because power and p-values are monotonically related (Hoenig & Heisey, 2001).  If p < .05 is the significance criterion, 50% power corresponds to a p-value of .05.  This also means that all z-scores less than 2.5 correspond to p-values greater than .05 once we take the influence of publication bias into account.  A z-score of 2.6 roughly corresponds to a p-value of .01.  Thus, a simple heuristic for readers of psychology journals is to consider only p < .01 values as significant, if they want to maintain the nominal error rate of 5%.

One problem with a general adjustment is that file drawers can differ across journals or authors.  The adjustment based on the general publication bias across journals will penalize authors who invest resources into well-designed studies with high power and it will fail to adjust fully for the effect of publication bias for authors that conduct many underpowered studies that capitalize on chance to produce significant results. It is widely recognized that scientific markets reward quantity of publications over quality.  A personalized adjustment can solve this problem because authors with large file drawers will get a  bigger adjustment and many of their nominally significant result will no longer be significant after an adjustment for publication bias has been made.

I illustrate this with two real world examples. The first example shows the powergraph of Marcel Zeelenberg.  The left powergraph shows a model that assumes no file drawer. The model fits the actual distribution of z-scores rather well. However, the graph shows a small bump of just significant results (z = 2 to 2.2) that is not explained by the model. This bump could reflect the use of questionable research practices (QRPs)but it is relatively small (as we will see shortly).  The graph on the right side uses only statistically significant results. This is important because only these results were published to claim a discovery. We see how the small bump leads has a strong effect on the estimate of the file drawer. It would require a large set of non-significant results to produce this bump. It is more likely that QRPs were used to produce it. However, the bump is small and overall replicability is higher than the average for all journals.  We also see that z-scores between 2 and 2.5 have an average replicability estimate of 52%. This means no adjustment is needed and p-values reported by Marcel Zellenberg can be interpreted without adjustment. Over the 15 year period, Marcel Zellenberg reported 537 significant results and we can conclude from this analysis that no more than 5% (27) of these results are false positive results.

Powergraphs for Marcel Zeelenberg.spex.png

 

A different picture emerges for the powergraph based on Ayalet Fishbach’s statistical results. The left graph shows a big bump of just significant results that is not explained by a model without publication bias.  The right graph shows that the replicabilty estimate is much lower than for Marcel Zeelenberg and for the analysis of all journals in 2016.

Powergraphs for Ayelet Fishbach.spex.png

The average replicabilty estimate for z-values between 2 and 2.5 is only 33%.  This means that researchers are unlikely to obtain a significant result, if they attempted an exact replication study of one of these findings.  More important, it means that p-values adjusted for publication bias are well above p > .05.  Even z-scores in the 2.5 to 3 band average only a replicabilty estimate of 46%. This means that only z-scores greater than 3 produce significant results after the correction for publication bias is applied.

Non-Significance Does Not Mean Null-Effect 

It is important to realize that a non-significant result does not mean that there is no effect. Is simply means that the signal to noise ratio is too weak to infer that an effect was present.  it is entirely possible that Ayelet Fishbach made theoretically correct predictions. However, to provide evidence for her hypotheses, she conducted studies with a high failure rate and many of these studies failed to support her hypotheses. These failures were not reported but they have to be taken into account in the assessment of the risk of a false discovery.  A p-value of .05 is only meaningful in the context of the number of attempts that have been made.  Nominally a p-value of .03 may appear to be the same across statistical analysis. But the real evidential value of a p-value is not equivalent.  Using powergraphs to equate evidentival value, a p-value of .05 published by Marcel Zeelenberg is equivalent to a p-value of .005 (z = 2.8) published by Ayelet Fischbach.

The Influence of Questionable Research Practices 

Powergraphs assume that an excessive number of significant results is caused by publication bias. However, questionable research practices also contribute to the reporting of mostly successful results.  Replicability estimates and the p-value ajdustment for publication bias may itself be biased by the use of QRPs.  Unfortunately, this effect is difficult to predict because different QRPs have different effects on replicability estimates. Some QRPs will lead to an overcorrection.  Although this creates uncertainty about the right amount of adjustment, a stronger adjustment may have the advantage that it could deter researchers from using QRPs because it would undermine the credibility of their published results.

Conclusion 

Over the past five years, psychologists have contemplated ways to improve the credibitliy and replicability of published results.  So far, these ideas have yet to show a notable effect on replicability (Schimmack, 2017).  One reason is that the incentive structure rewards number of publications and replicability is not considered in the review process. Reviewers and editors treat all p-values as equal, when they are not.  The ability to adjust p-values based on the true evidential value that they provide may help to change this.  Journals may lose their impact once readers adjust p-values and realize that many nominally significant result are actually not statistically significant after taking publication bias into account.