Category Archives: Replicability

How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

Over the past five years, psychological science has been in a crisis of confidence.  For decades, psychologists have assumed that published significant results provide strong evidence for theoretically derived predictions, especially when authors presented multiple studies with internal replications within a single article (Schimmack, 2012). However, even multiple significant results provide little empirical evidence, when journals only publish significant results (Sterling, 1959; Sterling et al., 1995).  When published results are selected for significance, statistical significance loses its ability to distinguish replicable effects from results that are difficult to replicate or results that are type-I errors (i.e., the theoretical prediction was false).

The crisis of confidence led to several initiatives to conduct independent replications. The most informative replication initiative was conducted by the Open Science Collaborative (Science, 2015).  It replicated close to 100 significant results published in three high-ranked psychology journals.  Only 36% of the replication studies replicated a statistically significant result.  The replication success rate varied by journal.  The journal “Psychological Science” achieved a success rate of 42%.

The low success rate raises concerns about the empirical foundations of psychology as a science.  Without further information, a success rate of 42% implies that it is unclear which published results provide credible evidence for a theory and which findings may not replicate.  It is impossible to conduct actual replication studies for all published studies.  Thus, it is highly desirable to identify replicable findings in the existing literature.

One solution is to estimate replicability for sets of studies based on the published test statistics (e.g., F-statistic, t-values, etc.).  Schimmack and Brunner (2016) developed a statistical method, Powergraphs, that estimates the average replicability of a set of significant results.  This method has been used to estimate replicability of psychology journals using automatic extraction of test statistics (2016 Replicability Rankings, Schimmack, 2017).  The results for Psychological Science produced estimates in the range from 55% to 63% for the years 2010-2016 with an average of 59%.   This is notably higher than the success rate for the actual replication studies, which only produced 42% successful replications.

There are two explanations for this discrepancy.  First, actual replication studies are not exact replication studies and differences between the original and the replication studies may explain some replication failures.  Second, the automatic extraction method may overestimate replicability because it may include non-focal statistical tests. For example, significance tests of manipulation checks can be highly replicable, but do not speak to the replicability of theoretically important predictions.

To address the concern about automatic extraction of test statistics, I estimated replicability of focal hypothesis tests in Psychological Science with hand-coded, focal hypothesis tests.  I used three independent data sets.

Study 1

For Study 1, I hand-coded focal hypothesis tests of all studies in the 2008 Psychological Science articles that were used for the OSC reproducibility project (Science, 2015).

OSC.PS

The powergraphs show the well-known effect of publication bias in that most published focal hypothesis tests report a significant result (p < .05, two-tailed, z > 1.96) or at least a marginally significant result (p < .10, two-tailed or p < .05, one-tailed, z > 1.65). Powergraphs estimate the average power of studies with significant results on the basis of the density distribution of significant z-scores.  Average power is an estimate of replicabilty for a set of exact replication studies.  The left graph uses all significant results. The right graph uses only z-scores greater than 2.4 because questionable research practices may produce many just-significant results and lead to biased estimates of replicability. However, both estimation methods produce similar estimates of replicability (57% & 61%).  Given the small number of statistics the 95%CI is relatively wide (left graph: 44% to 73%).  These results are compatible with the low actual success rate for actual replication studies (42%) and the estimate based on automated extraction (59%).

Study 2

The second dataset was provided by Motyl et al. (JPSP, in press), who coded a large number of articles from social psychology journals and psychological science. Importantly, they coded a representative sample of Psychological Science studies from the years 2003, 2004, 2013, and 2014. That is, they did not only code social psychology articles published in Psychological Science.  The dataset included 281 test statistics from Psychological Science.

PS.Motyl

The powergraph looks similar to the powergraph in Study 1.  More important, the replicability estimates are also similar (57% & 52%).  The 95%CI for Study 1 (44% to 73%) and Study 2 (left graph: 49% to 65%) overlap considerably.  Thus, two independent coding schemes and different sets of studies (2008 vs. 2003-2004/2013/2014) produce very similar results.

Study 3

Study 3 was carried out in collaboration with Sivaani Sivaselvachandran, who hand-coded articles from Psychological Science published in 2016.  The replicability rankings showed a slight positive trend based on automatically extracted test statistics.  The goal of this study was to examine whether hand-coding would also show an increase in replicability.  An increase was expected based on an editorial by D. Stephen Linday, incoming editor in 2015, who aimed to increase replicability of results published in Psychological Science by introducing badges for open data and preregistered hypotheses. However, the results failed to show a notable increase in average replicability.

PS.2016

The replicability estimate was similar to those in the first two studies (59% & 59%).  The 95%CI ranged from 49% to 70%. These wide confidence intervals make it difficult to notice small improvements, but the histogram shows that just significant results (z = 2 to 2.2) are still the most prevalent results reported in Psychological Science and that non-significant results that are to be expected are not reported.

Combined Analysis 

Given the similar results in all three studies, it made sense to pool the data to obtain the most precise estimate of replicability of results published in Psychological Science. With 479 significant test statistics, replicability was estimated at 58% with a 95%CI ranging from 51% to 64%.  This result is in line with the estimated based on automated extraction of test statistics (59%).  The reason for the close match between hand-coded and automated results could be that Psych Science publishes short articles and authors may report mostly focal results because space does not allow for extensive reporting of other statistics.  The hand-coded data confirm that replicabilty in Psychological Science is likely to be above 50%.

PS.combined

It is important to realize that the 58% estimate is an average.  Powergraphs also show average replicability for segments of z-scores. Here we see that replicabilty for just-significant results (z < 2.5 ~ p > .01) is only 35%. Even for z-score between 2.5 and 3.0 (~ p > .001) is only 47%.  Once z-scores are greater than 3, average replicabilty is above 50% and with z-scores greater than 4, replicability is greater than 80%.  For any single study, p-values can vary greatly due to sampling error, but in general a published result with a p-value < .001 is much more likely to replicate than a p-value > .01 (see also OSC, Science, 2015).

Conclusion

This blog-post used hand-coding of test-statistics published in Psychological Science, the flagship journal of the Association for Psychological Science, to estimate replicabilty of published results.  Three dataset produced convergent evidence that the average replicabilty of exact replication studies is 58% +/- 7%.  This result is consistent with estimates based on automatic extraction of test statistics.  It is considerably higher than the success rate of actual replication studies in the OSC reproducibility project (42%). One possible reason for this discrepancy is that actual replication studies are never exact replication studies, which makes it more difficult to obtain statistical significance if the original studies are selected for significance. For example, the original study may have had an outlier in the experimental group that helped to produce a significant result. Not removing this outlier is not considered a questionable research practice, but an exact replication study will not reproduce the same outlier and may fail to reproduce a just-significant result.  More broadly, any deviation from the assumptions underlying the computation of test statistics will increase the bias that is introduced by selecting significant results.  Thus, the 58% estimate is an optimistic estimate of the maximum replicability under ideal conditions.

At the same time, it is important to point out that 58% replicability for Psychological Science does not mean psychological science is rotten to the core (Motyl et al., in press) or that most reported results are false (Ioannidis, 2005).  Even results that did not replicate in actual replication studies are not necessarily false positive results.  It is possible that more powerful studies would produce a significant result, but with a smaller effect size estimate.

Hopefully, these analyses will spur further efforts to increase replicability of published results in Psychological Science and in other journals.  We are already near the middle of 2017 and can look forward to the 2017 results.

 

 

 

How replicable are statistically significant results in social psychology? A replication and extension of Motyl et al. (in press). 

Forthcoming article: 
Motyl, M., Demos, A. P., Carsel, T. S., Hanson, B. E., Melton, Z. J., Mueller, A. B., Prims, J., Sun, J., Washburn, A. N., Wong, K., Yantis, C. A., & Skitka, L. J. (in press). The state of social and personality science: Rotten to the core, not so bad, getting better, or getting worse? Journal of Personality and Social Psychology. (preprint)

Brief Introduction

Since JPSP published incredbile evidence for mental time travel (Bem, 2011), the credibility of social psychological research has been questioned.  There is talk of a crisis of confidence, a replication crisis, or a credibility crisis.  However, hard data on the credibility of empirical findings published in social psychology journals are scarce.

There have been two approaches to examine the credibility of social psychology.  One approach relies on replication studies.  Authors attempt to replicate original studies as closely as possible.  The most ambitious replication project was carried out by the Open Science Collaboration (Science, 2015) that replicated 1 study from 100 articles; 54 articles were classified as social psychology.   For original articles that reported a significant result, only a quarter replicated a significant result in the replication studies.  This estimate of replicability suggests that researches conduct many more studies than are published and that effect sizes in published articles are inflated by sampling error, which makes them difficult to replicate. One concern about the OSC results is that replicating original studies can be difficult.  For example, a bilingual study in California may not produce the same results as a bilingual study in Canada.  It is therefore possible that the poor outcome is partially due to problems of reproducing the exact conditions of original studies.

A second approach is to estimate replicability of published results using statistical methods.  The advantage of this approach is that replicabiliy estimates are predictions for exact replication studies of the original studies because the original studies provide the data for the replicability estimates.   This is the approach used by Motyl et al.

The authors sampled 30% of articles published in 2003-2004 (pre-crisis) and 2013-2014 (post-crisis) from four major social psychology journals (JPSP, PSPB, JESP, and PS).  For each study, coders identified one focal hypothesis and recorded the statistical result.  The bulk of the statistics were t-values from t-tests or regression analyses and F-tests from ANOVAs.  Only 19 statistics were z-tests.   The authors applied various statistical tests to the data that test for the presence of publication bias or whether the studies have evidential value (i.e., reject the null-hypothesis that all published results are false positives).  For the purpose of estimating replicability, the most important statistic is the R-Index.

The R-Index has two components.  First, it uses the median observed power of studies as an estimate of replicability (i.e., the percentage of studies that should produce a significant result if all studies were replicated exactly).  Second, it computes the percentage of studies with a significant result.  In an unbiased set of studies, median observed power and percentage of significant results should match.  Publication bias and questionable research practices will produce more significant results than predicted by median observed power.  The discrepancy is called the inflation rate.  The R-Index subtracts the inflation rate from median observed power because median observed power is an inflated estimate of replicability when bias is present.  The R-Index is not a replicability estimate.  That is, an R-Index of 30% does not mean that 30% of studies will produce a significant result.  However, a set of studies with an R-Index of 30 will have fewer successful replications than a set of studies with an R-Index of 80.  An exception is an R-Index of 50, which is equivalent with a replicability estimate of 50%.  If the R-Index is below 50, one would expect more replication failures than successes.

Motyl et al. computed the R-Index separately for the 2003/2004 and the 2013/2014 results and found “the R-index decreased numerically, but not statistically over time, from .62 [CI95% = .54, .68] in 2003-2004 to .52 [CI95% = .47, .56] in 2013-2014. This metric suggests that the field is not getting better and that it may consistently be rotten to the core.”

I think this interpretation of the R-Index results is too harsh.  I consider an R-Index below 50 an F (fail).  An R-Index in the 50s is a D, and an R-Index in the 60s is a C.  An R-Index greater than 80 is considered an A.  So, clearly there is a replication crisis, but social psychology is not rotten to the core.

The R-Index is a simple tool, but it is not designed to estimate replicability.  Jerry Brunner and I developed a method that can estimate replicability, called z-curve.  All test-statistics are converted into absolute z-scores and a kernel density distribution is fitted to the histogram of z-scores.  Then a mixture model of normal distributions is fitted to the density distribution and the means of the normal distributions are converted into power values. The weights of the components are used to compute the weighted average power. When this method is applied only to significant results, the weighted average power is the replicability estimate;  that is, the percentage of significant results that one would expect if the set of significant studies were replicated exactly.   Motyl et al. did not have access to this statistical tool.  They kindly shared their data and I was able to estimate replicability with z-curve.  For this analysis, I used all t-tests, F-tests, and z-tests (k = 1,163).   The Figure shows two results.  The left figure uses all z-scores greater than 2 for estimation (all values on the right side of the vertical blue line). The right figure uses only z-scores greater than 2.4.  The reason is that just-significant results may be compromised by questionable research methods that may bias estimates.

Motyl.2d0.2d4

The key finding is the replicability estimate.  Both estimations produce similar results (48% vs. 49%).  Even with over 1,000 observations there is uncertainty in these estimates and the 95%CI can range from 45 to 54% using all significant results.   Based on this finding, it is predicted that about half of these results would produce a significant result again in a replication study.

However, it is important to note that there is considerable heterogeneity in replicability across studies.  As z-scores increase, the strength of evidence becomes stronger, and results are more likely to replicate.  This is shown with average power estimates for bands of z-scores at the bottom of the figure.   In the left figure,  z-scores between 2 and 2.5 (~ .01 < p < .05) have only a replicability of 31%, and even z-scores between 2.5 and 3 have a replicability below 50%.  It requires z-scores greater than 4 to reach a replicability of 80% or more.   Similar results are obtained for actual replication studies in the OSC reproducibilty project.  Thus, researchers should take the strength of evidence of a particular study into account.  Studies with p-values in the .01 to .05 range are unlikely to replicate without boosting sample sizes.  Studies with p-values less than .001 are likely to replicate even with the same sample size.

Independent Replication Study 

Schimmack and Brunner (2016) applied z-curve to the original studies in the OSC reproducibility project.  For this purpose, I coded all studies in the OSC reproducibility project.  The actual replication project often picked one study from articles with multiple studies.  54 social psychology articles reported 173 studies.   The focal hypothesis test of each study was used to compute absolute z-scores that were analyzed with z-curve.

OSC.soc

The two estimation methods (using z > 2.0 or z > 2.4) produced very similar replicability estimates (53% vs. 52%).  The estimates are only slightly higher than those for Motyl et al.’s data (48% & 49%) and the confidence intervals overlap.  Thus, this independent replication study closely replicates the estimates obtained with Motyl et al.’s data.

Automated Extraction Estimates

Hand-coding of focal hypothesis tests is labor intensive and subject to coding biases. Often studies report more than one hypothesis test and it is not trivial to pick one of the tests for further analysis.  An alternative approach is to automatically extract all test statistics from articles.  This makes it also possible to base estimates on a much larger sample of test results.  The downside of automated extraction is that articles also report statistical analysis for trivial or non-critical tests (e.g., manipulation checks).  The extraction of non-significant results is irrelevant because they are not used by z-curve to estimate replicability.  I have reported the results of this method for various social psychology journals covering the years from 2010 to 2016 and posted powergraphs for all journals and years (2016 Replicability Rankings).   Further analyses replicated the results from the OSC reproducibility project that results published in cognitive journals are more replicable than those published in social journals.  The Figure below shows that the average replicability estimate for social psychology is 61%, with an encouraging trend in 2016.  This estimate is about 10% above the estimates based on hand-coded focal hypothesis tests in the two datasets above.  This discrepancy can be due to the inclusion of less original and trivial statistical tests in the automated analysis.  However, a 10% difference is not a dramatic difference.  Neither 50% nor 60% replicability justify claims that social psychology is rotten to the core, nor do they meet the expectation that researchers should plan studies with 80% power to detect a predicted effect.

replicability-cog-vs-soc

Moderator Analyses

Motyl et al. (in press) did extensive coding of the studies.  This makes it possible to examine potential moderators (predictors) of higher or lower replicability.  As noted earlier, the strength of evidence is an important predictor.  Studies with higher z-scores (smaller p-values) are, on average, more replicable.  The strength of evidence is a direct function of statistical power.  Thus, studies with larger population effect sizes and smaller sampling error are more likely to replicate.

It is well known that larger samples have less sampling error.  Not surprisingly, there is a correlation between sample size and the absolute z-scores (r = .3).  I also examined the R-Index for different ranges of sample sizes.  The R-Index was the lowest for sample sizes between N = 40 and 80 (R-Index = 43), increased for N = 80 to 200 (R-Index = 52) and further for sample sizes between 200 and 1,000 (R-Index = 69).  Interestingly, the R-Index for small samples with N < 40 was 70.  This is explained by the fact that research designs also influence replicability and that small samples often use more powerful within-subject designs.

A moderator analysis with design as moderator confirms this.  The R-Indices for between-subject designs is the lowest (R-Index = 48) followed by mixed designs (R-Index = 61) and then within-subject designs (R-Index = 75).  This pattern is also found in the OSC reproducibility project and partially accounts for the higher replicability of cognitive studies, which often employ within-subject designs.

Another possibility is that articles with more studies package smaller and less replicable studies.  However,  number of studies in an article was not a notable moderator:  1 study R-Index = 53, 2 studies R-Index = 51, 3 studies R-Index = 60, 4 studies R-Index = 52, 5 studies R-Index = 53.

Conclusion 

Motyl et al. (in press) coded a large and representative sample of results published in social psychology journals.  Their article complements results from the OSC reproducibility project that used actual replications, but a much smaller number of studies.  The two approaches produce different results.  Actual replication studies produced only 25% successful replications.  Statistical estimates of replicability are around 50%.   Due to the small number of actual replications in the OSC reproducibility project, it is important to be cautious in interpreting the differences.  However, one plausible explanation for lower success rates in actual replication studies is that it is practically impossible to redo a study exactly.  This may even be true when researchers conduct three similar studies in their own lab and only one of these studies produces a significant result.  Some non-random, but also not reproducible, factor may have helped to produce a significant result in this study.  Statistical models assume that we can redo a study exactly and may therefore overestimate the success rate for actual replication studies.  Thus, the 50% estimate is an optimistic estimate for the unlikely scenario that a study can be replicated exactly.  This means that even though optimists may see the 50% estimate as “the glass half full,” social psychologists need to increase statistical power and pay more attention to the strength of evidence of published results to build a robust and credible science of social behavior.

 

 

Hidden Figures: Replication Failures in the Stereotype Threat Literature

In the past five years, it has become apparent that many classic and important findings in social psychology fail to replicate (Schimmack, 2016).  The replication crisis is often considered a new phenomenon, but failed replications are not entirely new.  Sometimes these studies have simply been ignored.  These studies deserve more attention and need to be reevaluated in the context of the replication crisis in social psychology.

In the past, failed replications were often dismissed because seminal articles were assumed to provide robust empirical support for a phenomenon, especially if an article presented multiple studies. The chance of reporting a false positive results in a multiple study article is low because the risk of a false positive decreases exponentially (Schimmack, 2012). However, the low risk of a false positive is illusory if authors only publish studies that worked. In this case, even false positives can be supported by significant results in multiple studies, as demonstrated in the infamous ESP study by Bem (2011).  As a result, publication bias undermines the reporting of statistical significance as diagnostic information about the risk of false positives (Sterling, 1959) and many important theories in social psychology rest on shaky empirical foundations that need to be reexamined.

Research on stereotype threat and women’s performance on math tests is one example where publication bias undermines the findings in a seminal study that produced a large literature of studies on gender differences in math performance. After correcting for publication bias, this literature shows very little evidence that stereotype threat has a notable and practically significant effect on women’s math performance (Flore & Wicherts, 2014).

Another important line of research has examined the contribution of stereotype threat to differences between racial groups on academic performance tests.  This blog post examines the strength of the empirical evidence for stereotype threat effects in the seminal article by Steele and Aronson (1995). This article is currently the 12th most cited article in the top journal for social psychology, Journal of Personality and Social Psychology (2,278 citations so far).

According to the abstract, “stereotype threat is being at risk of confirming, as self-characteristic, a negative stereotype about one’s group.” Studies 1 and 2 showed that “reflecting the pressure of this vulnerability, Blacks underperformed in relation to Whites in the ability-diagnostic condition but not in the nondiagnostic condition (with Scholastic Aptitude Tests controlled).”  “Study 3 validated that ability-diagnosticity cognitively activated the racial stereotype in these participants and motivated them not to conform to it, or to be judged by it.”  “Study 4 showed that mere salience of the stereotype could impair Blacks’ performance even when the test was not
ability diagnostic.”

The results of Study 4 motivated Stricker and colleagues to examine the influence of stereotype-treat on test performance in a real-world testing situation.  These studies had large samples and were not limited to students at Stanford. One study was reported in a College Board Report (Stricker and Ward, 1998).   Another two studies were published in the Journal of Applied Social Psychology (Stricker & Ward, 2004).  This article received only 52 citations, although it reported two studies with an experimental manipulation of stereotype threat in a real assessment context.  One group of participants were asked about their gender or ethnicity before the text, the other group did not receive these questions.  As noted in the abstract, neither the inquiry about race, nor about gender, had a significant effect on test performance. In short, this study failed to replicate Study 4 of the classic and widely cited article by Steele and Aronson.

Stricker and Ward’s Abstract
Steele and Aronson (1995) found that the performance of Black research participants on
ability test items portrayed as a problem-solving task, in laboratory experiments, was affected adversely when they were asked about their ethnicity. This outcome was attributed to stereotype threat: Performance was disrupted by participants’ concerns about fulfilling the negative stereotype concerning Black people’s intellectual ability. The present field experiments extended that research to other ethnic groups and to males and females taking operational tests. The experiments evaluated the effects of inquiring about ethnicity and gender on the performance of students taking 2 standardized tests-the Advanced Placement Calculus AB Examination, and the Computerized Placement Tests-in actual test administrations. This inquiry did not have any effects on the test performance of Black, female, or other subgroups of students that were both statistically and practically significant.

The article also mentions a personal communication with Steele, in which Steele mentions an unpublished study that also failed to demonstrate the effect under similar conditions.

“In fact, Steele found in an unpublished pilot study that inquiring about ethnicity did not affect Black participants’ performance when the task was described as diagnostic of their ability (C. M. Steele, personal communication, May 2 1, 1997), in contrast to the
substantial effect of inquiring when the task was described as nondiagnostic.”

A substantive interpretation of this finding is that inquires about race or gender do not produce stereotype threat effects when a test is diagnostic because a diagnostic test already activates stereotype threat.  However, if this were a real moderator, it would be important to document this fact and it is not clear why this finding obtained in an earlier study by Steele remained unpublished. Moreover, it is premature to interpret the significant result in the published study with a non-diagnostic task and the non-significant result in an unpublished study with a diagnostic task as evidence that diagnosticity moderates the effect of the stereotype-threat manipulation. A proper test of this moderator hypothesis would require the demonstration of a three-way interaction between race, inquiry about race, and diagnosticity. Absent this evidence, it remains possible that diagnosticity is not a moderator and that the published result is a false positive (or a positive result with an inflated effect size estimate). In contrast, there appears to be consistent evidence that inquiries about race or gender before a real assessment of academic performance does not influence performance. This finding is not widely publicized, but is important for a better understanding of performance differences in real world settings.

The best way to examine the replicability of Steele and Aronson’s seminal finding with non-diagnostic tasks would be to conduct an exact replication study.  However, exact replication studies are difficult and costly.  An alternative is to examine the robustness of the published results by taking a closer look at the strength of the statistical results reported by Steele and Aronson, using modern statistical tests of publication bias and statistical power like the R-Index (Schimmack, 2014) and the Test of Insufficient Variance (TIVA, Schimmack, 2014).

Replicability Analysis of Steele and Aronson’s four studies

Study 1. The first study had a relatively large sample of N = 114 participants, but it is not clear how many of the participants were White or Black.  The study also had a 2 x 3 design, which leaves less than 20 participants per condition.   The study produced a significant main effect of condition, F(2, 107) = 4.74, and race, F(1,107) = 5.22, but the critical condition x race interaction was not significant (reported as p > .19).   However, a specific contrast showed significant differences between Black participants in the diagnostic condition and the non-diagnostic condition, t(107) = 2.88, p = .005, z = 2.82.  The authors concluded “in sum, then, the hypothesis was supported by the pattern of contrasts, but when tested over the whole design, reached only marginal significance” (p. 800).  In other words, Study 1 provided only weak support for the stereotype threat hypothesis.

Study 2. Study 2 eliminated one of the three experimental conditions. Participants were 20 Black and 20 White participants. This means there were only 10 participants in each condition of a 2 x 2 design. The degrees of freedom further indicate that the actual sample size was only 38 participants. Given the weak evidence in Study 1, there is no justification for a reduction in the number of participants per cell, although the difficulty of recruiting Black participants at Stanford may explain this inadequate sample size. Nevertheless, the study showed a significant interaction between race and test description, F(1,35) = 8.07, p = .007. The study also replicated the contrast from Study 1 that Black participants in the diagnostic condition performed significantly worse than Black participants in the non-diagnostic group, t(35) = 2.38, p = .023, z = 2.28.

Studies 1 and 2 are close replications of each other.  The consistent finding across the two studies that supports stereotype-treat theory is the finding that merely changing the description of an assessment task changes Black participants performance, as revealed by significant differences between the diagnostic and non-diagnostic condition in both studies.  The problem is that both studies had small numbers of Black participants and that small samples have low power to produce significant results. As a result, it is unlikely that a pair of studies would produce significant results in both studies.

Observed power  in the two studies is .81 and .62 with median observed power of .71. Thus, the actual success rate of 100% (2 out of 2 significant results) is 29 percentage points higher than the expected success rate. Moreover, when inflation is evident, median observed power is also inflated. To correct for this inflation, the Replicability-Index (R-Index) subtracts inflation from median observed power, which yields an R-Index of 42.  Any value below 50 is considered unacceptably low and I give it a letter grade F, just like students at American Universities receive an F for exams with less than 50% correct answers.  This does not mean that stereotype threat is not a valid theory or that there was no real effect in this pair of studies. It simply means that the evidence in this highly cited article is insufficient to make strong claims about the causes of Black’s performance on academic tests.

The Test of Insufficient Variance (TIVA) provides another way to examine published results.  Test statistics like t-values vary considerably from study to study even if the exact same study is conducted twice (or if one larger sample is randomly split into two sub-samples).  When test-statistics are converted into z-scores, sampling error (the random variability from sample to sample) follows approximately a standard normal distribution with a variance of 1.  If the variance is considerably smaller than 1, it suggests that the reported results represent a selected sample. Often the selection is a result of publication bias.  Applying TIVA to the pair of studies, yields a variance of Var(z) = 0.15.  As there are only two studies, it is possible that this outcome occurred by chance, p = .300, and it does not imply intentional selection for significance or other questionable research practices.  Nevertheless, it suggests that future replication studies will be more variable and produce some non-significant results.

In conclusion, the evidence presented in the first two studies is weaker than we might assume if we focused only on the fact that both studies produced significant contrasts. Given publication bias, the fact that both studies reported significant results provides no empirical evidence because virtually all published studies report significant results. The R-Index quantifies the strength of evidence for an effect while taking the influence of publication bias into account and it shows that the two studies with small samples provide only weak evidence for an effect.

Study 3.  This study did not examine performance. The aim was to demonstrate activation of stereotype threat with a sentence completion task.  The sample size of 68 participants  (35 Black, 33 White) implied that only 11 or 12 participants were assigned to one of the six cells in a 2 (race) by 3 (task description) design. The study produced main effects for race and condition, but most importantly it produced a significant interaction effect, F(2,61) = 3.30, p = .044.  In addition, Black participants in the diagnostic condition had more stereotype-related associations than Black participants in the non-diagnostic condition, t(61) = 3.53,

Study 4.  This study used inquiry about race to induce stereotype-threat. Importantly, the task was described as non-diagnostic (as noted earlier, a similar study produced no significant results when the task was described as diagnostic).  The design was a 2 x 2 design with 47 participants, which means only 11 or 12 participants were allocated to the four conditions.  The degrees of freedom indicated that cell frequencies were even lower. The study produced a significant interaction effect, F(1,39) = 7.82, p = .008.  The study also produced a significant contrast between Blacks in the race-prime condition and the no-prime condition, t(39) = 2.43, p = .020.

The contrast effect in Study 3 is strong, but it is not a performance measure.  If stereotype threat mediates the effect of task characteristics and performance, we would expect a stronger effect on the measure of the mediator than on the actual outcome of interest, task performance.  The key aim of stereotype threat theory is to explain differences in performance.  With a focus on performance outcomes, it is possible to examine the R-Index and TIVA of Studies 1, 2, and 4.  All three studies reported significant contrasts between Black students randomly assigned to two groups that were expected to show performance differences (Table 1).

Table 1

Study Test Statistic p-value z-score obs.pow
Study 1 t(107) = 2.88 0.005 2.82 0.81
Study 2 t(35)=2.38 0.023 2.28 0.62
Study 4 t(39) = 2.43 0.020 2.33 0.64

Median observed power is 64 and the R-Index is well below 50, 64 – 36 = 28 (F).  The variance in z-scores is Var(z) = 0.09, p = .086.  These results cast doubt about the replicability of the performance effects reported in Steele and Aronson’s seminal stereotype threat article.

Conclusion

Racial stereotypes and racial disparities are an important social issue.  Social psychology aims and promises to contribute to the understanding of this issue by conducting objective, scientific studies that can inform our understanding of these issues.  In order to live up to these expectations, social psychology has to follow the rules of science and listen to the data.  Just like it is important to get the numbers right to send men and women into space (and bring them back), it is important to get the numbers right when we use science to understand women and men on earth.  Unfortunately, social psychologists have not followed the examples of astronomers and the numbers do not add up.

The three African American women, features in this years movie “Hidden Figures”***,  Katherine Johnson, Dorothy Vaughan, and Mary Jackson might not approve of the casual way social psychologists use numbers in their research, especially the wide-spread practice of hiding numbers that do not match expectations.  No science that wants to make a real-world contribution can condone this practice.  It is also not acceptable to simply ignore published results from well-conducted studies with large samples that challenge a prominent theory.

Surely, the movie Hidden Figures dramatized some of the experiences of Black women at NASA, but there is little doubt that Katherine Johnson, Dorothy Vaughan, and Mary Jackson encountered many obstacles that might be considered stereotype threatening situations.  Yet, they prevailed and they paved the way for future generations of stereotyped groups.  Understanding racial and gender bias and performance differences remains an important issue and that is the reason why it is important to shed a light on hidden numbers and put simplistic theories under the microscope. Stereotype threat is too often used as a simple explanation that avoids tackling deeper and more difficult issues that cannot be easily studied in a quick laboratory experiment with undergraduate students at top research universities.  It is time for social psychologists to live up to its promises by tackling real world issues with research designs that have real world significance that produce real evidence using open and transparent research practices.

————————————————————————————————————————————

*** If you haven’t seen the movie, I highly recommend it.

 

Random measurement error and the replication crisis: A statistical analysis

This is a draft of a commentary on Loken and Gelman’s Science article “Measurement error and the replication crisis. Comments are welcome.

Random Measurement Error Reduces Power, Replicability, and Observed Effect Sizes After Selection for Significance

Ulrich Schimmack and Rickard Carlsson

In the article “Measurement error and the replication crisis” Loken and Gelman (LG) “caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger” (1). We agree with the overall message that it is a fallacy to interpret observed effect size estimates in small samples as accurate estimates of population effect sizes.  We think it is helpful to recognize the key role of statistical power in significance testing.  If studies have less than 50% power, effect sizes must be inflated to be significant. Thus, all observed effect sizes in these studies are inflated.  Once power is greater than 50%, it is possible to obtain significance with observed effect sizes that underestimate the population effect size. However, even with 80% power, the probability of overestimation is 62.5%. [corrected]. As studies with small samples and small effect sizes often have less than 50% power (2), we can safely assume that observed effect sizes overestimate the population effect size. The best way to make claims about effect sizes in small samples is to avoid interpreting the point estimate and to interpret the 95% confidence interval. It will often show that significant large effect sizes in small samples have wide confidence intervals that also include values close to zero, which shows that any strong claims about effect sizes in small samples are a fallacy (3).

Although we agree with Loken and Gelman’s general message, we believe that their article may have created some confusion about the effect of random measurement error in small samples with small effect sizes when they wrote “In a low-noise setting, the theoretical results of Hausman and others correctly show that measurement error will attenuate coefficient estimates. But we can demonstrate with a simple exercise that the opposite occurs in the presence of high noise and selection on statistical significance” (p. 584).  We both read this sentence as suggesting that under the specified conditions random error may produce even more inflated estimates than perfectly reliable measure. We show that this interpretation of their sentence would be incorrect and that random measurement error always leads to an underestimation of observed effect sizes, even if effect sizes are selected for significance. We demonstrate this fact with a simple equation that shows that true power before selection for significance is monotonically related to observed power after selection for significance. As random measurement error always attenuates population effect sizes, the monotonic relationship implies that observed effect sizes with unreliable measures are also always attenuated.  We provide the formula and R-Code in a Supplement. Here we just give a brief description of the steps that are involved in predicting the effect of measurement error on observed effect sizes after selection for significance.

The effect of random measurement error on population effect sizes is well known. Random measurement error adds variance to the observed measures X and Y, which lowers the observable correlation between two measures. Random error also increases the sampling error. As the non-central t-value is the proportion of these two parameters, it follows that random measurement error always attenuates power. Without selection for significance, median observed effect sizes are unbiased estimates of population effect sizes and median observed power matches true power (4,5). However, with selection for significance, non-significant results with low observed power estimates are excluded and median observed power is inflated. The amount of inflation is proportional to true power. With high power, most results are significant and inflation is small. With low power, most results are non-significant and inflation is large.

inflated-mop

Schimmack developed a formula that specifies the relationship between true power and median observed power after selection for significance (6). Figure 1 shows that median observed power after selection for significant is a monotonic function of true power.  It is straightforward to transform inflated median observed power into median observed effect sizes.  We applied this approach to Locken and Gelman’s simulation with a true population correlation of r = .15. We changed the range of sample sizes from 50 to 3050 to 25 to 1000 because this range provides a better picture of the effect of small samples on the results. We also increased the range of reliabilities to show that the results hold across a wide range of reliabilities. Figure 2 shows that random error always attenuates observed effect sizes, even after selection for significance in small samples. However, the effect is non-linear and in small samples with small effects, observed effect sizes are nearly identical for different levels of unreliability. The reason is that in studies with low power, most of the observed effect is driven by the noise in the data and it is irrelevant whether the noise is due to measurement error or unexplained reliable variance.

inflated-effect-sizes

In conclusion, we believe that our commentary clarifies how random measurement error contributes to the replication crisis.  Consistent with classic test theory, random measurement error always attenuates population effect sizes. This reduces statistical power to obtain significant results. These non-significant results typically remain unreported. The selective reporting of significant results leads to the publication of inflated effect size estimates. It would be a fallacy to consider these effect size estimates reliable and unbiased estimates of population effect sizes and to expect that an exact replication study would also produce a significant result.  The reason is that replicability is determined by true power and observed power is systematically inflated by selection for significance.  Our commentary also provides researchers with a tool to correct for the inflation by selection for significance. The function in Figure 1 can be used to deflate observed effect sizes. These deflated observed effect sizes provide more realistic estimates of population effect sizes when selection bias is present. The same approach can also be used to correct effect size estimates in meta-analyses (7).

References

1. Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science,

355 (6325), 584-585. [doi: 10.1126/science.aal3618]

2. Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145-153, http://dx.doi.org/10.1037/h004518

3. Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003. http://dx.doi.org/10.1037/0003-066X.49.12.99

4. Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551-566. http://dx.doi.org/10.1037/a0029487

5. Schimmack, U. (2016). A revised introduction to the R-Index. https://replicationindex.wordpress.com/2016/01/31/a-revised-introduction-to-the-r-index

6. Schimmack, U. (2017). How selection for significance influences observed power. https://replicationindex.wordpress.com/2017/02/21/how-selection-for-significance-influences-observed-power/

7. van Assen, M.A., van Aert, R.C., Wicherts, J.M. (2015). Meta-analysis using effect size distributions of only statistically significant studies. Psychological Methods, 293-309. doi: 10.1037/met0000025.

################################################################

#### R-CODE ###

################################################################

### sample sizes

N = seq(25,500,5)

### true population correlation

true.pop.r = .15

### reliability

rel = 1-seq(0,.9,.20)

### create matrix of population correlations between measures X and Y.

obs.pop.r = matrix(rep(true.pop.r*rel),length(N),length(rel),byrow=TRUE)

### create a matching matrix of sample sizes

N = matrix(rep(N),length(N),length(rel))

### compute non-central t-values

ncp.t = obs.pop.r / ( (1-obs.pop.r^2)/(sqrt(N – 2)))

### compute true power

true.power = pt(ncp.t,N-2,qt(.975,N-2))

###  Get Inflated Observed Power After Selection for Significance

inf.obs.pow = pnorm(qnorm(true.power/2+(1-true.power),qnorm(true.power,qnorm(.975))),qnorm(.975))

### Transform Into Inflated Observed t-values

inf.obs.t = qt(inf.obs.pow,N-2,qt(.975,N-2))

### Transform inflated observed t-values into inflated observed effect sizes

inf.obs.es = (sqrt(N + 4*inf.obs.t^2 -2) – sqrt(N – 2))/(2*inf.obs.t)

### Set parameters for Figure

x.min = 0

x.max = 500

y.min = 0.10

y.max = 0.45

ylab = “Inflated Observed Effect Size”

title = “Effect of Selection for Significance on Observed Effect Size”

### Create Figure

for (i in 1:length(rel)) {

print(i)

plot(N[,1],inf.obs.es[,i],type=”l”,xlim=c(x.min,x.max),ylim=c(y.min,y.max),col=col[i],xlab=”Sample Size”,ylab=”Median Observed Effect Size After Selection for Significance”,lwd=3,main=title)

segments(x0 = 600,y0 = y.max-.05-i*.02, x1 = 650,col=col[i], lwd=5)

text(730,y.max-.05-i*.02,paste0(“Rel = “,format(rel[i],nsmall=1)))

par(new=TRUE)

}

abline(h = .15,lty=2)

##################### THE END #################################

Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

 

Table 1 shows the results.

Power                  PTH / PFH             
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67                     

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH             
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14                    

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.

ioannidis-fig6

A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).

ioannidis-fig1

Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.

ioannidis-fig3

Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.

Powergraphs for JEP-LMC3.g

The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.

Powergraphs for PsySci3.g

The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.

Powergraphs for JPSP-ASC3.g

One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.

 

A replicability analysis of”I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning”

Dijksterhuis, A. (2004). I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning. JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY,   Volume: 86,   Issue: 2,   Pages: 345-355. 

DOI: 10.1037/0022-3514.86.2.345

There are a lot of articles with questionable statistical results and it seems pointless to single out particular articles.  However, once in a while, an article catches my attention and I will comment on the statistical results in it.  This is one of these articles….

The format of this review highlights why articles like this passed peer-review and are cited at high frequency as if they provided empirical facts.  The reason is a phenomenon called “verbal overshadowing.”   In work on eye-witness testimony, participants first see the picture of a perpetrator. Before the actual line-up task, they are asked to give a verbal description of the tasks.  The verbal description can distort the memory of the actual face and lead to a higher rate of misidentifications.  Something similar happens when researchers read articles. Sometimes they only read abstracts, but even when they read the article, the words can overshadow the actual empirical results. As a result, memory is more strongly influenced by verbal descriptions than by the cold and hard statistical facts.

In the first part, I will present the results of the article verbally without numbers. In the second part, I will present only the numbers.

Part 1:

In the article “I Like Myself but I Don’t Know Why: Enhancing Implicit Self-Esteem by
Subliminal Evaluative Conditioning” Ap Dijksterhuis reports the results of six studies (1-4, 5a, 5b).  All studies used a partially or fully subliminal evaluative conditioning task to influence implicit measures of self-esteem. The abstract states: “Participants were repeatedly presented with trials in which the word I was paired with positive trait terms. Relative to control conditions, this procedure enhanced implicit self-esteem.”  Study 1 used preferences for initials to measure implicit self-esteem. and “results confirmed the hypothesis that evaluative conditioning enhanced implicit self-esteem.” (p. 348). Study 2 modified the control condition and showed that “participants in the conditioned self-esteem condition showed higher implicit self-esteem after the treatment than before the treatment, relative to control participants” (p. 348).  Experiment 3 changed the evaluative conditioning procedure. Now, both the CS and the US (positive trait terms) were
presented subliminally for 17 ms.  It also used the Implicit Association Test to measure implicit self-esteem.  The results showed that “difference in response latency between blocks was much more pronounced in the conditioned self-esteem condition, indicating higher self-esteem” (p. 349).  Study 4 also showed that “participants in the conditioned self-esteem condition exhibited higher implicit self-esteem than participants
in the control condition” (p. 350).  Study 5a and 5b showed that “individuals whose
self-esteem was enhanced seemed to be insensitive to personality feedback, whereas control participants whose self-esteem was not enhanced did show effects of the intelligence feedback.” (p. 352).  The General Discussion section summarizes the results. “In our experiments, implicit self-esteem was enhanced through subliminal evaluative conditioning. Pairing the self-depicting word I with positive trait terms consistently improved implicit self-esteem.” (p. 352).  A final conclusion section points out the potential of this work for enhancing self-esteem. “It is worthwhile to explicitly mention an intriguing aspect of the present work. Implicit self-esteem can be enhanced, at least temporarily, subliminally in about 25 seconds.” (p. 353).

 

Part 2:

Study Statistic p z OP
1 F(1,76)=5.15 0.026 2.22 0.60
2 F(1,33)=4.32 0.046 2.00 0.52
3 F(1,14)=8.84 0.010 2.57 0.73
4 F(1,79)=7.45 0.008 2.66 0.76
5a F(1,89)=4.91 0.029 2.18 0.59
5b F(1,51)=4.74 0.034 2.12 0.56

All six studies produced statistically significant results. To achieve this outcome two conditions have to be met: (a) the effect exists and (b) sampling error is small to avoid a failed study  (i.e., a non-significant result even though the effect is real).   The probability of obtaining a significant result is called power. The last column shows observed power. Observed power can be used to estimate the actual power of the six studies. Median observed power is 60%.  With 60% power, we would expect that only 60% of the 6 studies (3.6 studies) produce a significant result, but all six studies show a significant result.  The excess of significant result shows that the results in this article present an overly positive picture of the robustness of the effect.  If these six studies were replicated exactly, we would not expect to obtain six significant results again.  Moreover, the inflation of significant results also leads to an inflation of the power estimate. The R-Index corrects for this inflation by subtracting the inflation rate (100% observed success rate – 60% median observed power) from the power estimate.  The R-Index is .60 – .40 = .20.  Results with such a low R-Index often do not replicate in independent replication attempts.

Another method to examine the replicability of these results is to examine the variability of the z-scores (second last column).  Each z-score reflects the strength of evidence against the null-hypothesis. Even if the same study is replicated, this measure will vary as a function of random sampling.  The expected variance is approximately 1 (the standard deviation of a standard normal distribution).  Low variance suggests that future studies will produce more variable results and with p-values close to .05, this means that future studies are expected to produce non-significant results.  This bias test is called the Test of Insufficient Variance (TIVA).  The variance of the z-scores is Var(z) = 0.07.  The probability of this restricted variance to occur by chance is p = .003 (1/300).

Based on these results, the statistical evidence presented in this article is questionable and does not provide support for the conclusion that subliminal evaluative conditioning can enhance implicit self-esteem.  Another problem with this conclusion is that implicit self-esteem measures have low reliability and low convergent validity.  As a result, we would not expect strong and consistent effects of any experimental manipulation on these measures.  Finally, even if a small and reliable effect could be obtained, it remains an open question whether this effect shows an effect on implicit self-esteem or whether the manipulation produces a systematic bias in the measurement of implicit self-esteem.  “It is not yet known how long the effects of this manipulation last. In addition, it is not yet
known whether people who could really benefit from enhanced self-esteem (i.e., people with problematically low levels of self-esteem) can benefit from subliminal conditioning techniques.” (p. 353).  12 years later, we may wonder whether these results have been replicated in other laboratories and whether these effects last more than a few minutes after the conditioning experiment.

If you like Part I better, feel free to boost your self-esteem here.

selfesteemboost