Are Most Published Results in Psychology False? An Empirical Study

Why Most Published Research Findings  are False by John P. A. Ioannidis

In 2005, John P. A. Ioannidis wrote an influential article with the title “Why Most Published Research Findings are False.” The article starts with the observation that “there is increasing concern that most current published research findings are false” (e124). Later on, however, the concern becomes a fact. “It can be proven that most claimed research findings are false” (e124). It is not surprising that an article that claims to have proof for such a stunning claim has received a lot of attention (2,199 citations and 399 citations in 2016 alone in Web of Science).

Most citing articles focus on the possibility that many or even more than half of all published results could be false. Few articles cite Ioannidis to make the factual statement that most published results are false, and there appears to be no critical examination of Ioannidis’s simulations that he used to support his claim.

This blog post shows that these simulations make questionable assumptions and shows with empirical data that Ioannidis’s simulations are inconsistent with actual data.

Critical Examination of Ioannidis’s Simulations

First, it is important to define what a false finding is. In many sciences, a finding is published when a statistical test produced a significant result (p < .05). For example, a drug trial may show a significant difference between a drug and a placebo control condition with a p-value of .02. This finding is then interpreted as evidence for the effectiveness of the drug.

How could this published finding be false? The logic of significance testing makes this clear. The only inference that is being made is that the population effect size (i.e., the effect size that could be obtained if the same experiment were repeated with an infinite number of participants) is different from zero and in the same direction as the one observed in the study. Thus, the claim that most significant results are false implies that in more than 50% of all published significant results the null-hypothesis was true. That is, a false positive result was reported.

Ioannidis then introduces the positive predictive value (PPV). The positive predictive value is the proportion of positive results (p < .05) that are true positives.

(1) PPV = TP/(TP + FP)

PTP = True Positive Results, FP = False Positive Results

The proportion of true positive results (TP) depends on the percentage of true hypothesis (PTH) and the probability of producing a significant result when a hypothesis is true. This probability is known as statistical power. Statistical power is typically defined as 1 minus the type-II error (beta).

(2) TP = PTH * Power = PTH * (1 – beta)

The probability of a false positive result depends on the proportion of false hypotheses (PFH) and the criterion for significance (alpha).

(3) FP = PFH * alpha

This means that the actual proportion of true significant results is a function of the ratio of true and false hypotheses (PTH:PFH), power, and alpha.

(4) PPV = (PTH*power) / ((PTH*power) + (PFH * alpha))

Ioannidis translates his claim that most published findings are false into a PPV below 50%. This would mean that the null-hypothesis is true in more than 50% of published results that falsely rejected it.

(5) (PTH*power) / ((PTH*power) + (PFH * alpha))  < .50

Equation (5) can be simplied to the inequality equation

(6) alpha > PTH/PFH * power

We can rearrange formula (6) and substitute PFH with (1-PHT) to determine the maximum proportion of true hypotheses to produce over 50% false positive results.

(7a)  =  alpha = PTH/(1-PTH) * power

(7b) = alpha*(1-PTH) = PTH * power

(7c) = alpha – PTH*alpha = PTH * power

(7d) =  alpha = PTH*alpha + PTH*power

(7e) = alpha = PTH(alpha + power)

(7f) =  alpha/(power + alpha) = PTH

 

Table 1 shows the results.

Power                  PTH / PFH             
90%                       5  / 95
80%                       6  / 94
70%                       7  / 93
60%                       8  / 92
50%                       9  / 91
40%                      11 / 89
30%                       14 / 86
20%                      20 / 80
10%                       33 / 67                     

Even if researchers would conduct studies with only 20% power to discover true positive results, we would only obtain more than 50% false positive results if only 20% of hypothesis were true. This makes it rather implausible that most published results could be false.

To justify his bold claim, Ioannidis introduces the notion of bias. Bias can be introduced due to various questionable research practices that help researchers to report significant results. The main effect of these practices is that the probability of a false positive result to become significant increases.

Simmons et al. (2011) showed that massive use several questionable research practices (p-hacking) can increase the risk of a false positive result from the nominal 5% to 60%. If we assume that bias is rampant and substitute the nominal alpha of 5% with an assumed alpha of 50%, fewer false hypotheses are needed to produce more false than true positives (Table 2).

Power                 PTH/PFH             
90%                     40 / 60
80%                     43 / 57
70%                     46 / 54
60%                     50 / 50
50%                     55 / 45
40%                     60 / 40
30%                     67 / 33
20%                     75 / 25
10%                      86 / 14                    

If we assume that bias inflates the risk of type-I errors from 5% to 60%, it is no longer implausible that most research findings are false. In fact, more than 50% of published results would be false if researchers tested hypothesis with 50% power and 50% of tested hypothesis are false.

However, the calculations in Table 2 ignore the fact that questionable research practices that inflate false positives also decrease the rate of false negatives. For example, a researcher who continues testing until a significant result is obtained, increases the chances of obtaining a significant result no matter whether the hypothesis is true or false.

Ioannidis recognizes this, but he assumes that bias has the same effect for true hypothesis and false hypothesis. This assumption is questionable because it is easier to produce a significant result if an effect exists than if no effect exists. Ioannidis’s assumption implies that bias increases the proportion of false positive results a lot more than the proportion of true positive results.

For example, if power is 50%, only 50% of true hypothesis produce a significant result. However, with a bias factor of .4, another 40% of the false negative results will become significant, adding another .4*.5 = 20% true positive results to the number of true positive results. This gives a total of 70% positive results, which is a 40% increase over the number of positive results that would have been obtained without bias. However, this increase in true positive results pales in comparison to the effect that 40% bias has on the rate of false positives. As there are 95% true negatives, 40% bias produces another .95*.40 = 38% of false positive results. So instead of 5% false positive results, bias increases the percentage of false positive results from 5% to 43%, an increase by 760%. Thus, the effect of bias on the PPV is not equal. A 40% increase of false positives has a much stronger impact on the PPV than a 40% increase of true positives. Ioannidis provides no rational for this bias model.

A bigger concern is that Ioannidis makes sweeping claims about the proportion of false published findings based on untested assumptions about the proportion of null-effects, statistical power, and the amount of bias due to questionable research practices.
For example, he suggests that 4 out of 5 discoveries in adequately powered (80% power) exploratory epidemiological studies are false positives (PPV = .20). To arrive at this estimate, he assumes that only 1 out of 11 hypotheses is true and that for every 1000 studies, bias adds only 1000* .30*.10*.20 = 6 true positives results compared to 1000* .30*.90*.95 = 265 false positive results (i.e., 44:1 ratio). The assumed bias turns a PPV of 62% without bias into a PPV of 20% with bias. These untested assumptions are used to support the claim that “simulations show that for most study designs and settings, it is more likely for a research claim to be false than true.” (e124).

Many of these assumptions can be challenged. For example, statisticians have pointed out that the null-hypothesis is unlikely to be true in most studies (Cohen, 1994). This does not mean that all published results are true, but Ioannidis’ claims rest on the opposite assumption that most hypothesis are a priori false. This makes little sense when the a priori hypothesis is specified as a null-effect and even a small effect size is sufficient for a hypothesis to be correct.

Ioannidis also ignores attempts to estimate the typical power of studies (Cohen, 1962). At least in psychology, the typical power is estimated to be around 50%. As shown in Table 2, even massive bias would still produce more true than false positive results, if the null-hypothesis is false in no more than 50% of all statistical tests.

In conclusion, Ioannidis’s claim that most published results are false depends heavily on untested assumptions and cannot be considered a factual assessment of the actual number of false results in published journals.

Testing Ioannidis’s Simulations

10 years after the publication of “Why Most Published Research Findings Are False,”  it is possible to put Ioannidis’s simulations to an empirical test. Powergraphs (Schimmack, 2015) can be used to estimate the average replicability of published test results. For this purpose, each test statistic is converted into a z-value. A powergraph is foremost a histogram of z-values. The distribution of z-values provides information about the average statistical power of published results because studies with higher power produce higher z-values.

Figure 1 illustrates the distribution of z-values that is expected for Ioanndis’s model for “adequately powered exploratory epidemiological study” (Simulation 6 in Figure 4). Ioannidis assumes that for every true positive, there are 10 false positives (R = 1:10). He also assumed that studies have 80% power to detect a true positive. In addition, he assumed 30% bias.

ioannidis-fig6

A 30% bias implies that for every 100 false hypotheses, there would be 33 (100*[.30*.95+.05]) rather than 5 false positive results (.95*.30+.05)/.95). The effect on false negatives is much smaller (100*[.30*.20 + .80]). Bias was modeled by increasing the number of attempts to produce a significant result so that proportion of true and false hypothesis matched the predicted proportions. Given an assumed 1:10 ratio of true to false hypothesis, the ratio is 335 false hypotheses to 86 true hypotheses. The simulation assumed that researchers tested 100,000 false hypotheses and observed 35000 false positive results and that they tested 10,000 true hypotheses and observed 8,600 true positive results. Bias was simulated by increasing the number of tests to produce the predicted ratio of true and false positive results.

Figure 1 only shows significant results because only significant results would be reported as positive results. Figure 1 shows that a high proportion of z-values are in the range between 1.95 (p = .05) and 3 (p = .001). Powergraphs use z-curve (Schimmack & Brunner, 2016) to estimate the probability that an exact replication study would replicate a significant result. In this simulation, this probability is a mixture of false positives and studies with 80% power. The true average probability is 20%. The z-curve estimate is 21%. Z-curve can also estimate the replicability for other sets of studies. The figure on the right shows replicability for studies that produced an observed z-score greater than 3 (p < .001). The estimate shows an average replicability of 59%. Thus, researchers can increase the chance of replicating published findings by adjusting the criterion value and ignoring significant results with p-values greater than p = .001, even if they were reported as significant with p < .05.

Figure 2 shows the distribution of z-values for Ioannidis’s example of a research program that produces more true than false positives, PPV = .85 (Simulation 1 in Table 4).

ioannidis-fig1

Visual inspection of Figure 1 and Figure 2 is sufficient to show that a robust research program produces a dramatically different distribution of z-values. The distribution of z-values in Figure 2 and a replicability estimate of 67% are impossible if most of the published significant results were false.  The maximum value that could be obtained is obtained with a PPV of 50% and 100% power for the true positive results, which yields a replicability estimate of .05*.50 + 1*.50 = 55%. As power is much lower than 100%, the real maximum value is below 50%.

The powergraph on the right shows the replicability estimate for tests that produced a z-value greater than 3 (p < .001). As only a small proportion of false positives are included in this set, z-curve correctly estimates the average power of these studies as 80%. These examples demonstrate that it is possible to test Ioannidis’s claim that most published (significant) results are false empirically. The distribution of test results provides relevant information about the proportion of false positives and power. If actual data are more similar to the distribution in Figure 1, it is possible that most published results are false positives, although it is impossible to distinguish false positives from false negatives with extremely low power. In contrast, if data look more like those in Figure 2, the evidence would contradict Ioannidis’s bold and unsupported claim that most published results are false.

The maximum replicabiltiy that could be obtained with 50% false-positives would require that the true positive studies have 100% power. In this case, replicability would be .50*.05 + .50*1 = 52.5%.  However, 100% power is unrealistic. Figure 3 shows the distribution for a scenario with 90% power and 100% bias and an equal percentage of true and false hypotheses. The true replicabilty for this scenario is .05*.50 + .90 * .50 = 47.5%. z-curve slightly overestimates replicabilty and produced an estimate of 51%.  Even 90% power is unlikely in a real set of data. Thus, replicability estimates above 50% are inconsistent with Ioannidis’s hypothesis that most published positive results are false.  Moreover, the distribution of z-values greater than 3 is also informative. If positive results are a mixture of many false positive results and true positive results with high power, the replicabilty estimate for z-values greater than 3 should be high. In contrast, if this estimate is not much higher than the estimate for all z-values, it suggest that there is a high proportion of studies that produced true positive results with low power.

ioannidis-fig3

Empirical Evidence

I have produced powergraphs and replicability estimates for over 100 psychology journals (2015 Replicabilty Rankings). Not a single journal produced a replicability estimate below 50%. Below are a few selected examples.

The Journal of Experimental Psychology: Learning, Memory and Cognition publishes results from cognitive psychology. In 2015, a replication project (OSC, 2015) demonstrated that 50% of significant results produced a significant result in a replication study. It is unlikely that all non-significant results were false positives. Thus, the results show that Ioannidis’s claim that most published results are false does not apply to results published in this journal.

Powergraphs for JEP-LMC3.g

The powergraphs further support this conclusion. The graphs look a lot more like Figure 2 than Figure 1 and the replicability estimate is even higher than the one expected from Ioannidis’s simulation with a PPV of 85%.

Another journal that was subjected to replication attempts was Psychological Science. The success rate for Psychological Science was below 50%. However, it is important to keep in mind that a non-significant result in a replication study does not prove that the original result was a false positive. Thus, the PPV could still be greater than 50%.

Powergraphs for PsySci3.g

The powergraph for Psychological Science shows more z-values in the range between 2 and 3 (p > .001). Nevertheless, the replicability estimate is comparable to the one in Figure 2 which simulated a high PPV of 85%. Closer inspection of the results published in this journal would be required to determine whether a PPV below .50 is plausible.

The third journal that was subjected to a replication attempt was the Journal of Personality and Social Psychology. The journal has three sections, but I focus on the Attitude and Social Cognition section because many replication studies were from this section. The success rate of replication studies was only 25%. However, there is controversy about the reason for this high number of failed replications and once more it is not clear what percentage of failed replications were due to false positive results in the original studies.

Powergraphs for JPSP-ASC3.g

One problem with the journal rankings is that they are based on automated extraction of all test results. Ioannidis might argue that his claim focused only on test results that tested an original, novel, or an important finding, whereas articles also often report significance tests for other effects. For example, an intervention study may show a strong decrease in depression, when only the interaction with treatment is theoretically relevant.

I am currently working on powergraphs that are limited to theoretically important statistical tests. These results may show lower replicability estimates. Thus, it remains to be seen how consistent Ioannidis’s predictions are for tests of novel and original hypotheses. Powergraphs provide a valuable tool to address this important question.

Moreover, powergraphs can be used to examine whether science is improving. So far, powergraphs of psychology journals have shown no systematic improvement in response to concerns about high false positive rates in published journals. The powergraphs for 2016 will be published soon. Stay tuned.

 

Reexamining Cunningham, Preacher, and Banaji’s Multi-Method Model of Racism Measures

Article:
William A. Cunningham, Kristopher J. Preacher, and Mahzarin R. Banaji. (2001).
Implicit Attitude Measures: Consistency, Stability, and Convergent Validity, Psychological Science, 12(2), 163-170.

Abstract:
In recent years, several techniques have been developed to measure implicit social cognition. Despite their increased use, little attention has been devoted to their reliability and validity. This article undertakes a direct assessment of the interitem consistency, stability, and convergent validity of some implicit attitude measures. Attitudes toward blacks and whites were measured on four separate occasions, each 2 weeks apart, using three relatively implicit measures (response window evaluative priming, the Implicit Association Test, and the response-window Implicit Association Test) and one explicit measure (Modern Racism Scale). After correcting for interitem inconsistency with latent variable analyses, we found that (a) stability indices improved and (b) implicit measures were substantially correlated with each other, forming a single latent factor. The psychometric properties of response-latency implicit measures have greater integrity than recently suggested.

Critique of Original Article

This article has been cited 362 times (Web of Science, January 2017).  It still is one of the most rigorous evaluations of the psychometric properties of the race Implicit Association Test (IAT).  As noted in the abstract, the strength of the study is the use of several implicit measures and the repeated measurement of attitudes on four separate occasions.  This design makes it possible to separate several variance components in the race IAT.  First, it is possible to examine how much variance is explained by causal factors that are stable over time and shared by implicit and explicit attitude measures.  Second, it is possible to measure the amount of variance that is unique to the IAT.  As this component is not shared with other implicit measures, this variance can be attributed to systematic measurement error that is stable over time.  A third variance component is variance that is shared only with other implicit measures and that is stable over time. This variance component could reflect stable implicit racial attitudes.  Finally, it is possible to identify occasion specific variance in attitudes.  This component would reveal systematic changes in implicit attitudes.

The original article presents a structural equation model that makes it possible to identify some of these variance components.  However, the model is not ideal for this purpose and the authors do not test some of these variance components.  For example, the model does not include any occasion specific variation in attitudes.  This could be because attitudes do not vary over the one-month interval of the study, or it could mean that the model failed to specify this variance component.

This reanalysis also challenges the claim by the original authors that they provided evidence for a dissociation of implicit and explicit attitudes.  “We found a dissociation between implicit and explicit measures of race attitude: Participants simultaneously self-reported nonprejudiced explicit attitudes toward black Americans while showing an implicit difficulty in associating black with positive attributes” (p. 169). The main problem is that the design does not allow to make this claim because the study included only a single explicit racism measure.  Consequently, it is impossible to determine whether unique variance in the explicit measure reflects systematic measurement in explicit attitude measures (social desirable responding, acquiescence response styles) or whether this variance reflects consciously accessible attitudes that are distinct from implicit attitudes.  In this regard, the authors claim that “a single-factor solution does not fit the data” (p. 170) is inconsistent with their own structural equation model that shows a single second-order factor that explains the covariance among the three implicit measures and the explicit measure.

The authors caution that a single IAT measure is not very reliable, but their statement about reliability is vague. “Our analyses of implicit attitude measures suggest that the degree of measurement error in response-latency measures can be substantial; estimates of Cronbach’s alpha indicated that, on average, more than 30% of the variance associated with the measurements was random error.” (p. 160).  More than 30% random measurement error leaves a rather large range of reliability estimates ranging from 0% to 70%.   The respective parameter estimates for the IAT in Figure 4 are .53^2 = .28, .65^2 = .42, .74^2 = .55, and .38^2 = .14.  These reliability estimates vary considerably due to the small sample size, but the loading of the first IAT would suggest that only 19% of the variance in a single IAT is reliable. As reliablity is the upper limit for validity, it would imply that no more than 20% of the variance in a single IAT captures variation in implicit racial attitudes.

The authors caution readers about the use of a single IAT to measure implicit attitudes. “When using latency-based measures as indices of individual differences, it may be essential to employ analytic techniques, such as covariance structure modeling, that can separate measurement error from a measure of individual differences. Without such analyses, estimates of relationships involving implicit measures may produce misleading null results” (p. 169).  However, the authors fail to mention that the low reliability of a single IAT also has important implications for the use of the IAT for the assessment of implicit prejudice.  Given this low estimate of validity, users of the Harvard website that provides information about individual’s performance on the IAT should be warned that the feedback is neither reliable nor valid by conventional standards for psychological tests.

Reanalysis of Published Correlation Matrix

The Table below reproduces the correlation matrix. The standard deviations in the last row are rescaled to avoid rounding problems. This has no effect on the results.

1
.80   1
.78 .82  1
.76 .77 .86   1
.21 .15 .15 .14   1
.13 .14 .10 .08 .31  1
.16 .26 .23 .20 .42 .50 1
.14 .17 .16 .13 .16 .33 .17 1
.20 .16 .19 .26 .33 .11 .23 .07 1
.26 .29 .18 .19 .20 .27 .36 .29 .26   1
.35 .33 .34 .25 .28 .29 .34 .33 .36 .39   1
.19 .17 .08 .07 .12 .25 .30 .14 .01 .17 .24 1
.00 .11 .07 .04 .27 .18 .19 .02 .03 .01 .02 .07 1
.16 .08 .04 .08 .26 .27 .24 .22 .14 .32 .32 .17 .13 1
.12 .01 .02 .07 .13 .19 .18 .00 .02 .00 .11 .04 .17 .30 1
.33 .18 .26 .31 .14 .24 .31 .15 .22 .20 .27 .04 .01 .48 .42 1

SD 0.84 0.82 0.88 0.86 2.2066 1.2951 1.0130 0.9076 1.2 1.0 1.1 1.0 0.7 0.8 0.8 0.9

1-4 = Modern Racism Scale (1-4); 5-8 Implicit Association Test (1-4);  9-12 = Response Window IAT (1-4);  13-16 Response Window Evaluative Priming (1-4)

newmodel

Fitting the data to the original model reproduced the original results.  I then fitted the data to a model with a single attitude factor (see Figure 1).  The model also allowed for measure-specific variances.  An initial model showed no significant measure-specific variances for the two versions of the IAT .  Hence, these method factors were not included in the final model.  To control for variance that is clearly consciously accessible, I modeled the relationship between the explicit factor and the attitude factor as a causal path from the explicit factor to the attitude factor.  This path should not be interpreted as a causal relationship in this case. Rather the path can be used to estimate how much of the variance in the attitude factor is explained by consciously accessible information that influences the explicit measure.  In this model, the residual variance is variation that is shared among implicit measures, but not with the explicit measure.

The model had good fit to the data.  I then imposed constraints on factor loadings.  The constrained model had better fit than the unconstrained model (delta AIC = 4.60, delta BIC = 43.53).  The main finding is that the standard IAT had a loading of .55 on the attitude factor.  The indirect path from the implicit attitude factor to a single IAT measure is only slightly smaller, .55*.92 = .51.  The 95%CI for this parameter ranged from .41 to .60.  The upper bound of the 95%CI would imply that at most 36% of the variance in a single IAT reflects implicit racial attitudes.  However, it is important to note that the model in Figure 1 assumes that the Modern Racism Scale is a perfectly valid measure of consciously accessible attitudes. Any systematic measurement error in the Modern Racism Scale would reduce the amount of variance in the attitude factor that reflects unconscious factors.  Again, the lack of multiple explicit measures makes it impossible to separate systematic measurement error from valid variance in explicit measures.  Thus, the amount of variance in a single IAT that reflects unconscious racial attitudes can range from 0 to 36%.

How Variable are Implicit Racial Attitudes?

The design repeated measurement of implicit attitudes on four occasions.  If recent experiences influence implicit attitudes, we would expect that implicit measures of attitudes on the same occasion are more highly correlated with each other than implicit measures taken on different occasions.  Given the low validity of implicit attitude measures, I examined this question with constrained parameters. By estimating a single parameter, the model has more power to reveal a consistent relationship between implicit measures that were obtained during the same testing session.  Neither the two IATs, nor the IAT and the evaluative priming task (EP) showed significant occasion-specific variance.  Although this finding may be due to low power to detect occasion specific variation, this finding suggests that most of the variance in an IAT is due to stable variation and random measurement error.

Conclusion

Cunningham et al. (2001) conducted a rigorous psychometric study of the Implicit Association Test.  The original article reported results that could be reproduced.  The authors correctly interpret their results as evidence that a single IAT has low reliability. However, they falsely imply that their results provide evidence that the IAT and other implicit measures are valid measures of an implicit form of racism that is not consciously accessible.  My new analysis shows that their results are consistent with this hypothesis, if one assumes that the Modern Racism Scale is a perfectly valid measure of consciously accessible racial attitudes.  Under this assumption, about 25% (95%CI 16-36) of the variance in a single IAT would reflect implicit attitudes.  However, it is rather unlikely that the Modern Racism Scale is a perfect measure of explicit racial attitudes, and the amount of variance in performance on the IAT that reflects unconscious racism is likely to be smaller. Another important finding that was implicit, but not explicitly mentioned, in the original model is that there is no evidence for situation-specific variation in implicit attitudes. At least over the one-month period of the study, racial attitudes remained stable and did not vary as a function of naturally occurring events that might influence racial attitudes (e.g., positive or negative intergroup contact).  This finding may explain why experimental manipulations of implicit attitudes also often produce very small effects (Joy Gaba & Nosek, 2010).

One surprising finding was that the IAT showed no systematic measurement error in this model. This would imply that repeated measures of the IAT could be used to measure racial attitudes with high validity.  Unfortunately, most studies with the IAT rely on a single testing situation and ignore that most of the variance in a single IAT is measurement error.  To improve research on racial attitudes and prejudice, social psychologists should use multiple explicit and implicit measures and use structural equation models to examine which variance components of a measurement model of racial attitudes predict actual behavior.

Validity of the Implicit Association Test as a Measure of Implicit Attitudes

This blog post reports the results of an analysis of correlations among 4 explicit and 3 implicit attitude measures published by Ranganath, Tucker, and Nosek (2008).

Original article:
Kate A. Ranganath, Colin Tucker Smith, & Brian A. Nosek (2008). Distinguishing automatic and controlled components of attitudes from direct and indirect measurement methods. Journal of Experimental Social Psychology 44 (2008) 386–396; doi:10.1016/j.jesp.2006.12.008

Abstract
Distinct automatic and controlled processes are presumed to influence social evaluation. Most empirical approaches examine automatic processes using indirect methods, and controlled processes using direct methods. We distinguished processes from measurement methods to test whether a process distinction is more useful than a measurement distinction for taxonomies of attitudes. Results from two studies suggest that automatic components of attitudes can be measured directly. Direct measures of automatic attitudes were reports of gut reactions (Study 1) and behavioral performance in a speeded self-report task (Study 2). Confirmatory factor analyses comparing two factor models revealed better fits when self-reports of gut reactions and speeded self-reports shared a factor with automatic measures versus sharing a factor with controlled self-report measures. Thus, distinguishing attitudes by the processes they are presumed to measure (automatic versus controlled) is more meaningful than distinguishing based on the directness of measurement.

Description of Original Study

Study 1 measured relative attitudes towards heterosexuals and homosexuals with seven measures; four explicit measures and three reaction time tasks. Specifically, the four explicit measures were

Actual = Participants were asked to report their “actual feelings” towards gay and straight people when given enough time for full consideration on a scale ranging from 1=very negative to 8 = very positive.

Gut = Participants were asked to report their “gut reaction” towards gay and straight people when given enough time for full consideration on a scale ranging from 1=very negative to 8 = very positive.

Time0 and Time5: A second explicit rating task assessed an “attitude timeline”. Participants reported their attitudes toward the two groups at multiple time points: (1) instant reaction, (2) reaction a split-second later, (3) reaction after 1 s, (4) reaction after 5 s, and (5) reaction when given enough time to think fully. Only the first (Time0) and the last (Time5) rating were included in the model.

The three reaction time measures were the Implicit Association Test (IAT), the Go-NoGo Association Test (GNAT), and a Four-Category Sorting Paired Features Task (SPF). All three measures use differences in response times to measure attitudes.

Table A1 in the Appendix reported the correlations among the seven tasks.

IAT 1
GNAT .36 1
SPF .26 .18 1
GUT .23 .33 .12 1
Actual .16 .31 .01 .65 1
Time0 .19 .31 .16 .85 .50 1
Time5 .01 .24 .01 .54 .81 .50 1

The authors tested a variety of structural equation models. The best fitting model, preferred by the authors, was a model with three correlated latent factors. “In this three-factor model, self-reported gut feelings (GutFeeling, Instant Feeling) comprised their own attitude factor distinct from a factor comprised of the indirect, automatic measures (IAT, GNAT, SPF) and from a factor comprised of the direct, controlled measures (Actual Feeling, Fully Considered Feeling). The data were an excellent fit (chi^2(12) = 10.8).

The authors then state “while self-reported gut feelings were more similar to the indirect measures than to the other self-reported attitude measures, there was some unique variance in self-reported gut feelings that was distinct from both.” (p. 391) and they go on to speculate that “one possibility is that these reports are a self-theory that has some but not complete correspondence with automatic evaluations” (p. 391). The also consider the possibility that “measures like the IAT, GNAT, and SPF partly assess automatic evaluations that are “experienced” and amenable to introspective report, and partly evaluations that are not” (p. 391). But they favor the hypothesis that “self-report of ‘gut feelings’ is a meaningful account of some components of automatic evaluation” (p. 391). The interpret these results as strong support for their “contention that a taxonomy of attitudes by measurement features is not as effective as one that distinguishes by presumed component processes” (p. 391). The conclusion reiterates this point. “The present studies suggest that attitudes have distinct but related automatic and controlled factors contributing to social evaluation and that parsing attitudes by underlying processes is superior to parsing attitude measures by measurement features” (p. 393). Surprisingly, the author do not mention the three-factor model in the Discussion and rather claim support for a two-factor model that distinguishes processes rather than measures (explicit vs. implicit). “In both studies, model comparison using confimatory factor analysis showed the data were better fit to a two-factor model distinguishing automatic and controlled components of attitudes than to a model distinguishing attitudes by whether they were measured directly or indirectly” (p. 393). The authors then suggest that some explicit measures (ratings of gut reactions) can measure automatic attitudes. “These findings suggest that direct measures can be devised to capture automatic components of attitudes despite suggestions that indirect measures are essential for such assessments” (p. 393).

New Analysis 

The main problem with this article is that the author never report parameter estimates for the model. Depending on the pattern of correlations among the three factors and factor loadings, the interpretation of the results can change. I first tried to fit the three-factor model to the covariance matrix (setting variances to 1) to the published correlation matrix. MPLUS7.1 showed some problems with negative residual variance for Actual. Also the model had one less degree of freedom than the published model. However, fixing the residual variance of actual did not solve the problem. I then proceeded to fit my own model. The model is essentially the same model as the three-factor model with the exception that I modeled the correlation among the three-latent factor with a single higher-order factor. This factor represents variation in common causes that influences all attitude measures. The problem of negative variance in the actual measure was solved by allowing for an extra correlation between the actual and gut ratings. As seen in the correlation table, these two explicit measures correlated more highly with each other (r = .65) than the corresponding T0 and T5 measures (rs = .54, .50). As in the original article, model fit was good (see Figure). Figure 1 shows for the first time the parameter estimates of the model.

attitude-multi-method

 

The loadings of the explicit measures on the primary latent factors are above .80. For single item measures, this implies that these ratings are essentially measuring the same construct with some random error. Thus, the latent factors can be interpreted as explicit ratings of affective responses immediately or after some reflection. The loadings of these two factors on the higher order factor show that reflective and immediate responses are strongly influenced by the common factor. This is not surprising. Reflection may alter the immediate response somewhat, but it is unlikely to reverse or dramatically change the response a few seconds later. Interestingly, the immediate response has a higher loading on the attitude factor, although in this small sample the differences in loadings is not significant (chi^2(1) = 0.22. The third primary factor represents the shared variance among the three reaction time measures. It also loads on the general attitude factor, but the loading is weaker than the loading for the explicit measures. The parameter estimates suggest that about 25% of the variance is explained by the common attitude (.51^2) and 75% is unique to the reaction time measures. This variance component can be interpreted as unique variance in implicit measures. The factor loadings of the three reaction time measures are also relevant. The loading of the IAT suggests that only 28% (.53^2) of the observed variance in the IAT reflects the effect of causal factors that influence reaction time measures of attitudes. As some of this variance is also shared with explicit measures, only 21% ((.86*.53)^2) of the variance in the IAT represents the variance in the implicit attitude factor This has important implications for the use of the IAT to examine potential effects of implicit attitudes on behavior. Even if implicit attitudes had a strong effect on a behavior (r = .5), the correlation between IAT scores and the behavior only would be r = .86*.53*.5 = .23. A sample size of N = 146 participants would be needed to have 80% power to provide significant evidence for such a relationship (p < .05, two-tailed). Given a more modest effect of attitudes on behavior, r = .86*.53*.30 = .14, the sample size would need to be larger (N = 398). As many studies of implicit attitudes and behavior used smaller samples, we would expect many non-significant results, unless non-significant results remain unreported and published results report inflated effect sizes. One solution to the problem of low power in studies of implicit attitudes would be the use of multiple implicit attitude measures. This study suggests that a battery of different reaction time tasks can be used to remove random and task specific measurement error. Such a multi-method approach to the measurement of implicit attitudes is highly recommended for future studies because it would also help to interpret results of studies in which implicit attitudes do not influence behavior. If a set of implicit measures show convergent validity, this finding would indicate that implicit attitudes did not influence the behavior. In contrast, a null-result with a single implicit measure may simply show that the measure failed to measure implicit attitudes.

Conclusion

This article reported some interesting data, but failed to report the actual results. This analysis of the data showed that explicit measures are highly correlated with each other and show discriminant validity from implicit, reaction time measures. The new analysis also made it possible to estimate the amount of variance in the Implicit Association Test that reflects variance that is not shared with explicit measures but shared with other implicit measures. The estimate of 20% suggests that most of the variance in the IAT is due to factors other than implicit attitudes and that the test cannot be used to diagnose individuals. Whether the 20% of variance that is uniquely shared with other implicit measures reflects unconscious attitudes or method variance that is common to reaction time tasks remains unclear. The model also implies that predictive validity of a single IAT for prejudice behaviors is expected to be small to moderate (r < .30), which means large samples are needed to study the effect of implicit attitudes on behavior.

 

 

 

 

Replicability Review of 2016

2016 was surely an exciting year for anybody interested in the replicability crisis in psychology. Some of the biggest news stories in 2016 came from attempts by the psychology establishment to downplay the replication crisis in psychological research (Weired Magazine). At the same time, 2016 delivered several new replication failures that provide further ammunition for the critics of established research practices in psychology.

I. The Empire Strikes Back

1. The Open Science Collaborative Reproducibility Project was flawed.

Daniel Gilbert, Tim Wilson published a critique of the Open Science Collaborative in Science. According to Gilbert and Wilson the project that replicated 100 original research studies and reported that they could only replicate 36% was error riddled. Consequently, the low success rate only reveals the incompetence of replicators and has no implications for the replicability of original studies published in prestigious psychological journals like Psychological Science. Science Daily suggested that the critique overturned the landmark study.

science-daily-overturn

Nature published a more balanced commentary.  In an interview, Gilbert explains that “the number of studies that actually did fail to replicate is about the number you would expect to fail to replicate by chance alone — even if all the original studies had shown true effects.”   This quote is rather strange, if we really consider the replication studies as flawed and error riddled.  If the replication studies were bad, we would expect fewer studies to replicate than we would expect based on chance alone.  If the success rate of 36% is consistent with the effect of chance alone, the replication studies are just as good as the original studies and the only reason for non-significant results would be chance. Thus, Gilbert’s comment implies that he believes the typical statistical power of a study in psychology is about 36%. Gilbert doesn’t seem to realize that he is inadvertently admitting that published articles report vastly inflated success rates because 97% of the original studies reported a significant result.  To report 97% significant results with an average power of 36%, researchers are either hiding studies that failed to support their favored hypotheses in proverbial file-drawers or they are using questionable research practices to inflate evidence in favor of their hypotheses. Thus, ironically Gilberts’ comments rather confirm the critiques of the establishment that the low success rate in the reproducibility project can be explained by selective reporting of evidence that supports authors’ theoretical predictions.

2. Contextual Sensitivity Explains Replicability Problem in Social Psychology

Jay van Bavel and colleagues made a second attempt to downplay the low replicability of published results in psychology. He even got to write about it in the New York Times.

vanbavel-nyt

Van Bavel blames the Open Science Collaboration for overlooking the importance context. “Our results suggest that many of the studies failed to replicate because it was difficult to recreate, in another time and place, the exact same conditions as those of the original study.”   This statement caused a lot of bewilderment.  First, the OSC carefully tried to replicate the original studies as closely as possible.  At the same time, they were sensitive to the effect of context. For example, if a replication study of an original study in the US was carried out in Germany, stimulus words were translated from English into German because one might expect that native German speakers might not respond the same way to the original English words as native English speakers.  However, the switching of languages means that the replication study is not identical to the original study. Maybe the effect can only be obtained with English speakers. And if the study was conducted at Harvard, maybe the effect can only be replicated with Harvard students. And if the study was conducted primarily with female students, it may not replicate with male students.

To provide evidence for his claim, Jay van Bavel obtained subjective ratings of contextual sensitivity. That is, raters guessed how sensitivity the outcome of a study is to variations in the context.  These ratings were then used to predict the success of the 100 replication studies in the OSC project.

Jay van Bavel proudly summarized the results in the NYT article. “As we predicted, there was a correlation between these context ratings and the studies’ replication success: The findings from topics that were rated higher on contextual sensitivity were less likely to be reproduced. This held true even after we statistically adjusted for methodological factors like sample size of the study and the similarity of the replication attempt. The effects of some studies could not be reproduced, it seems, because the replication studies were not actually studying the same thing.”

The article leaves out a few important details.  First, the correlation between contextual sensitivity ratings and replication success was small, r = .20.  Thus, even if contextual sensitivity contributed to replication failures, it would only explain replication failures for a small percentage of studies. Second, the authors used several measures of replicability and some of these measures failed to show the predicted relationship. Third, the statement makes an elementary mistake of confusing correlation and causality.  The authors merely demonstrated that subjective ratings of contextual sensitivity predicted outcomes of replication studies. They did not show that contextual sensitivity caused replication failures.  Most important, Jay van Bavel failed to mention that they also conducted an analysis that controlled for discipline. The Open Science Collaborative had already demonstrated that studies in cognitive psychology are more replicable (50% success rate) than studies in social psychology (an awful 25%).  In an analysis that controlled for differences in disciplines, contextual sensitivity was no longer a statistically significant predictor of replication failures.  This hidden fact was revealed in a commentary (or should we say correction) by Joel Inbar.  In conclusion, this attempt at propping up the image of social psychology as a respectable science with replicable results turned out to be another embarrassing example of sloppy research methodology.

3. Anti-Terrorism Manifesto by Susan Fiske

Later that year, former president of the Association for Psychological Science (APS) caused a stir by comparing critics of established psychology to terrorists (see Business Insider article).  She later withdrew the comparison to terrorists in response to the criticism of her remarks on social media (APS website).

Fiske.png

Fiske attempted to defend established psychology by arguing that established psychology is self-correcting and does not require self-appointed social-media vigilantes. She claimed that these criticisms were destructive and damaging to psychology.

“Our field has always encouraged — required, really — peer critiques.”

“To be sure, constructive critics have a role, with their rebuttals and letters-to-the-editor subject to editorial oversight and peer review for tone, substance, and legitimacy.”

“One hopes that all critics aim to improve the field, not harm people. But the fact is that some inappropriate critiques are harming people. They are a far cry from temperate peer-reviewed critiques, which serve science without destroying lives.”

Many critics of established psychology did not share Fiske’s rosy and false description of the way psychology operates.  Peer-review has been shown to be a woefully unreliable process. Moreover, the key criterion for accepting a paper is that it presents flawless results that seem to support some extraordinary claims (a 5-minute online manipulation reduces university drop-out rates by 30%), no matter how these results were obtained and whether they can be replicated.

In her commentary, Fiske is silent about the replication crisis and does not reconcile her image of a critical peer-review system with the fact that only 25% of social psychological studies are replicable and some of the most celebrated findings in social psychology (e.g., elderly priming) are now in doubt.

The rise of blogs and Facebook groups that break with the rules of the establishment poses a threat to the APS establishment with the main goal of lobbying for psychological research funding in Washington. By trying to paint critics of the establishment as terrorists, Fiske tried to dismiss criticism of established psychology without having to engage with the substantive arguments why psychology is in crisis.

In my opinion her attempt to do so backfired and the response to her column showed that the reform movement is gaining momentum and that few young researchers are willing to prop up a system that is more concerned about publishing articles and securing grant money than about making real progress in understanding human behavior.

II. Major Replication Failures in 201

4. Epic Failure to Replicate Ego-Depletion Effect in a Registered Replication Report

Ego-depletion is a big theory in social psychology and the inventor of the ego-depletion paradigm, Roy Baumeister, is arguable one of the biggest names in contemporary social psychology.  In 2010, a meta-analysis seemed to confirm that ego-depletion is a highly robust and replicable phenomenon.  However, this meta-analysis failed to take publication bias into account.  In 2014, a new meta-analysis revealed massive evidence of publication bias. It also found that there was no statistically reliable evidence for ego-depletion after taking publication bias into account (Slate, Huffington Post).

Ego.Depletion.Crumbling.png

A team of researchers, including the first-author of the supportive meta-analysis from 2010, conducted replication studies, using the same experiment in 24 different labs.  Each of these studies alone would have had a low probability to detect a small ego depletion effect, but the combined evidence from all 24 labs made it possible to detect an ego-depletion effect even if it were much smaller than published articles suggest.  Yet, the project failed to find any evidence for an ego-depletion effect, suggesting that it is much harder to demonstrate ego-depletion effects than one would believe based on over 100 published articles with successful results.

Critics of Baumeister’s research practices (Schimmack) felt vindicated by this stunning failure. However, even proponents of ego-depletion theory (Inzlicht) acknowledged that ego-depletion theory lacks a strong empirical foundation and that it is not clear what 20 years of research on ego-depletion have taught us about human self-control.

Not so, Roy Baumeister.  Like a bank that is too big to fail, Baumeister defended ego-depletion as a robust empirical finding and blamed the replication team for the negative outcome.  Although he was consulted and approved the design of the study, he later argued that the experimental task was unsuitable to induce ego-depletion. It is not hard to see the circularity in Baumeister’s argument.  If a study produces a positive result, the manipulation of ego-depletion was successful. If a study produces a negative result, the experimental manipulation failed. The theory is never being tested because it is taken for granted that the theory is true. The only empirical question is whether an experimental manipulation was successful.

Baumeister also claimed that his own lab has been able to replicate the effect many times, without explaining the strong evidence for publication bias in the ego-depletion literature and the results of a meta-analysis that showed results from his own lab are no different from results from other labs.

A related article by Baumeister in a special issue on the replication crisis in psychology was another highlight in 2016.  In this article, Baumeister introduced the concept of FLAIR.

scientist-with-flair   Scientist with FLAIR

Baumeister writes “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants. Over the years the norm crept up to about n = 20. Now it seems set to leap to n = 50 or more.” (JESP, 2016, p. 154).  He misses the god old days and suggests that the old system rewarded researchers with flair.  “Patience and diligence may be rewarded, but competence may matter less than in the past. Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure. Flair, intuition, and related skills matter much less with n = 50.” (JESP, 2016, p. 156).

This quote explains the low replication rate in social psychology and the failure to replicate ego-depletion effects.   It is simply not possible to conduct studies with n = 10 and be successful in most studies because empirical studies in psychology are subject to sampling error.  Each study with n = 10 on a new sample of participants will produce dramatically different results because sample of n = 10 are very different from each other.  This is a fundamental fact of empirical research that appears to elude on of the most successful empirical social psychologists.  So, a researcher with FLAIR may set up a clever experiment with a strong manipulation (e.g, smelling chocolate cookies and have participants eat radishes instead) and get a significant result. But this is not a replicable finding. For every study with fair that worked, there are numerous studies that did not work. However, researchers with flair ignore these failed studies and focus on the studies that worked and then use these studies for publication.  It can be shown statistically that they do, as I did with Baumeister’s glucose studies (Schimmack, 2012) and Baumeister’s ego-depletion studies in general (Schimmack, 2016).  So, a researchers who gets significant results with small samples (n = 10) surely has FLAIR (False, Ludicrous, And Incredible Results).

Baumeister’s article contained additional insights into the research practices that fueled a highly productive and successful career.  For example, he distinguishes researchers who report boring true positive results and interesting researches who publish interesting false positive results.  He argues that science needs both types of researchers. Unfortunately, most people assume that scientists prioritize truth, which is the main reason for subjecting theories to empirical tests. But scientists with FLAIR get positive results even when their interesting ideas are false (Bem, 2011).

Baumeister mentions psychoanalysis as an example of interesting psychology. What could be more interesting than the Freudian idea that every boy goes through a phase where he wants to kill daddy and make love to mommy.  Interesting stuff, indeed, but this idea has no scientific basis.  In contrast, twin studies suggest that many personality traits, values, and abilities are partially inherited. To reveal this boring fact, it was necessary to recruit large samples of thousands of twins.  That is not something a psychologist with FLAIR can handle.  “When I ran my own experiments as a graduate student and young professor, I struggled to stay motivated to deliver the same instructions and manipulations through four cells of n=10 each. I do not know how I would have managed to reach n=50. Patient, diligent researchers will gain, relative to others” (Baumeister, JESP, 2016, p. 156). So, we may see the demise of researchers with FLAIR and diligent and patient researchers who listen to their data may take their place. Now there is something to look forward to in 2017.

scientist-without-flair Scientist without FLAIR

5. No Laughing Matter: Replication Failure of Facial Feedback Paradigm

A second Registered Replication Report (RRR) delivered another blow to the establishment.  This project replicated a classic study on the facial-feedback hypothesis.  Like other peripheral emotion theories, facial-feedback theories assume that experiences of emotions depend (fully or partially) on bodily feedback.  That is, we feel happy because we smile rather than we smile because we are happy.  Numerous studies had examined the contribution of bodily feedback to emotional experience and the evidence was mixed.  Moreover, studies that found effects had a major methodological problem. Simply asking participants to smile might make them think happy thoughts, which could elicit positive feelings.  In the 1980s, social psychologist Fritz Strack invented a procedure that solved this problem (see Slate article).  Participants are deceived to believe that they are testing a procedure for handicapped people to complete a questionnaire by holding a pen in their mouth.  Participants who hold the pen with their lips are activating muscles that are activated during sadness. Participants who hold the pen with their teeth activate muscles that are activated during happiness.  Thus, randomly assigning participants to one of these two conditions made it possible to manipulate facial muscles without making participants aware of the associated emotion.  Strack and colleagues reported two experiments that showed effects of the experimental manipulation.  Or did it?  It depends on the statistical test being used.

slate-facial-feedback

Experiment 1 had three conditions. The control group did the same study without manipulation of the facial muscles. The dependent variable was funniness ratings of cartoons.  The mean funniness of cartoons was highest in the smile condition, followed by the control condition, and the lowest mean in the frown condition.  However, a commonly used Analysis of Variance would not have produced a significant result.  A two-tailed t-test also would not have produced a significant result.  However a linear contrast with a one-tailed t-test produced a just significant result, t(89) = 1.85, p = .03.  So, Fritz Strack was rather lucky to get a significant result.  Sampling error could have easily changed the pattern of means slightly and even the directional test of the linear contrast would not have been significant.  At the same time, sampling error might have been against the facial feedback hypothesis and the real effect is stronger than this study suggests. In this case, we would expect to see stronger evidence in Study 2.  However, Study 2 failed to show any effect on funniness ratings of cartoons.  “As seen in Table 2, subjects’ evaluations of the cartoons were hardly affected under the different experimental conditions. The ANOVA showed no significant main effects or interactions, all ps > .20” (Strack et al., 1988).  However, Study 2 also included amusement ratings, and the amusement ratings once more showed a just significant result with a one-tailed t-test, t(75) = 1.78, p = .04.  The article also provides an explanation for the just-significant result in Study 1, even though Study 1 used funniness ratings of cartoons.  When participants are not asked to differentiate between their subjective feelings of amusement and the objective funniness of cartoons, subjective feelings influence ratings of funniness, but given a chance to differentiate between the two, subjective feelings no longer influence funniness ratings.

For 25 years, this article was uncritically cited as evidence for the facial feedback hypothesis, but none of the 17 labs that participated in the RRR were able to produce a significant result. More important, even an analysis with the combined power of all studies failed to detect an effect.  Some critics pointed out that this result successfully replicates the finding of the original two studies that also failed to report statistically significant results by conventional standards of a two-tailed test (or z > 1.96).

Given the shaky evidence in the original article, it is not clear why Fritz Strack volunteered his study for a replication attempt.  However, it is easier to understand his response to the results of the RRR.  He does not take the results seriously.  He rather believes his two original, marginally significant, studies than the 17 replication studies.

“Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.”  (Slate).

One of the most bizarre statements by Strack can only be interpreted as revealing a shocking lack of understanding of probability theory.

“So when Strack looks at the recent data he sees not a total failure but a set of mixed results. Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed? Maybe there’s a reason why half the labs could not elicit the effect.” (Slate).

This is like a roulette player who after a night of gambling sees 49% wins and 49% loses and ponders why 49% of the attempts produced losses. Strack does not seem to realize that results of individual studies move simply by chance just like roulette balls produce different results by chance. Some people find cartoons funnier than others and the mean will depend on the allocation of these individuals to the different groups.  This is called sampling error, and this is why we need to do statistical tests in the first place.  And apparently it is possible to become a famous social psychologist without understanding the purpose of computing and reporting p-values.

And the full force of defense mechanisms is apparent in the next statement.  “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. (Slate).

No, there were not eight non-replications. There were 17!  We would expect half of the studies to match the direction of the original effect simply due to chance alone.

But this is not all.  Strack even accused the replication team of “reverse p-hacking.” (Strack, 2016).  The term p-hacking was coined by Simmons et al. (2011) to describe a set of research practices that can be used to produce statistically significant results in the absence of a real effect (fabricating false positives).  Strack turned it around and suggested that the replication team used statistical tricks to make the facial feedback effect disappear.  “Without insinuating the possibility of a reverse p hacking, the current anomaly needs to be further explored.” (p. 930).

However, the statistical anomaly that requires explanation could just be sampling error (Hillgard) and it actually is the wrong statistical pattern to claim reverse p-hacking.  Reverse p-hacking implies that some studies did produce a significant result, but statistical tricks were used to report the result as non-significant. This would lead to a restriction in the variability of results across studies, which can be detected with the Test for Insufficient Variance (Schimmack, 2015), but there is no evidence for reverse p-hacking in the RRR.

Fritz Strack also tried to make his case on social media, but there was very little support for his view that 17 failed replication studies can be ignored (PsychMAP thread).

strack-psychmap

Strack’s desperate attempts to defend his famous original study in the light of a massive replication failure provide further evidence for the inability of the psychology establishment to face the reality that many celebrated discoveries in psychology rest on shaky evidence and a mountain of repressed failed studies.

Meanwhile the Test of Insufficient Variance provides a simple explanation for the replication failure, namely the original results were rather unlikely to occur in the first place.  Converting the observed t-values into z-scores shows very low variability, Var(z) = 0.003. The probability of observing a variance this small or smaller in a pair of studies is only p = .04.  It is just not very likely for such an improbable event to repeat itself

6. Insufficient Power in Power-Posing Research

When you google “power posing” the featured link shows Amy Cuddy giving a TED talk about her research. Not unlike facial feedback, power posing assumes that bodily feedback can have powerful effects.

Cuddy.Power.Posing.png

When you scroll down to the page, you might find a link to an article by Gelman and Fung (Slate).

Gelman has been an outspoken critic of social psychology for some time.  This article is no exception. “Some of the most glamorous, popular claims in the field are nothing but tabloid fodder. The weakest work with the boldest claims often attracts the most publicity, helped by promotion from newspapers, television, websites, and best-selling books.”

Wonder.Woman.png

They point out that a much larger study than the original study failed to replicate the original findings.

“An outside team led by Eva Ranehill attempted to replicate the original Carney, Cuddy, and Yap study using a sample population five times larger than the original group. In a paper published in 2015, the Ranehill team reported that they found no effect.”

They have little doubt that the replication study can be trusted and suggest that the original results were obtained with the help of questionable research practices.

“We know, though, that it is easy for researchers to find statistically significant comparisons even in a single, small, noisy study. Through the mechanism called p-hacking or the garden of forking paths, any specific reported claim typically represents only one of many analyses that could have been performed on a dataset.”

The replication study was published in 2015, so this replication failure does not really belong into a review of 2016.  Indeed, the big news in 2016 was that Cuddy’s co-author Carney distanced herself from her contribution to the power posing article.   Her public rejection of her own work (New Yorker Magazine) spread like a wildfire through social media (Psych Methods FB Group Posts 1, 2, but  see 3). Most responses were very positive.  Although science is often considered a self-correcting system, individual scientists rarely correct mistakes or retract articles if they discover a mistake after publication.  Carney’s statement was seen as breaking with the implicit norm of the establishment to celebrate every published article as an important discovery and to cover up mistakes even in the face of replication failures.

carney-statement

Not surprisingly, proponent of power posing, Amy Cuddy, defended her claims about power posing. Here response makes many points, but there is one glaring omission. She does not mention the evidence that published results are selected to confirm theoretical claims and she does not comment on the fact that there is no evidence for power posing after correcting for publication bias.  The psychology establishment also appears to be more interested in propping up a theory that has created a lot of publicity for psychology rather than critically examining the scientific evidence for or against power posing (APS Annual Meeting, 2017, Program, Presidential Symposium).

7. Commitment Priming: Another Failed Registered Replication Report

Many research questions in psychology are difficult to study experimentally.  For example, it seems difficult and unethical to study the effect of infidelity on romantic relationships by assigning one group of participants to an infidelity condition and make them engage in non-marital sex.  Social psychologists have developed a solution to this problem.  Rather than creating real situations, participants are primed to think about infidelity. If these thoughts change their behavior, the results are interpreted as evidence for the effect of real infidelity.  Eli Finkel and colleagues used this approach to experimentally test the effect of commitment on forgiveness.  To manipulate commitment, participants in the experimental group were given some statements that were supposed to elicit commitment-related thoughts.  To make sure that this manipulation worked, participants then completed a commitment measure.  In the original article, the experimental manipulation had a strong effect, d = .74, which was highly significant, t(87) = 3.43, p < .001.  Irene Cheung, Lorne Campbell, and Etienne P. LeBel spearheaded an initiative to replicate the experimental effect of commitment priming on forgiveness.  Eli Finkel closely worked with the replication team to ensure that the replication study replicated the original study as closely as possible.  Yet, the replication studies failed to demonstrate effectiveness of the commitment manipulation. Even with the much larger sample size, there was no significant effect and the effect size was close to zero.  The authors of the replication report were surprised by the failure of the manipulation. “It is unclear why the RRR studies observed no effect of priming on subjective commitment when the original study observed a large effect. Given the straightforward nature of the priming manipulation and the consistency of the RRR results across settings, it seems unlikely that the difference resulted from extreme context sensitivity or from cohort effects (i.e., changes in the population between 2002 and 2015).” (PPS, 2016, p. 761).  The author of the original article, Eli Finkel, also has no explanation for the failure of the experimental manipulation. “Why did the manipulation that successfully influenced commitment in 2002 fail to do so in the RRR? I don’t know.” (PPS, 2016, p. 765).  However, Eli Finkel also reports that he made changes to the manipulation in subsequent studies. “The RRR used the first version of a manipulation that has been refined in subsequent work. Although I believe that the original manipulation is reasonable, I no longer use it in my own work. For example, I have become concerned that the “low commitment” prime includes some potentially commitment-enhancing elements (e.g., “What is one trait that your partner will develop as he/she grows older?”). As such, my collaborators and I have replaced the original 5-item primes with refined 3-item primes (Hui, Finkel, Fitzsimons, Kumashiro, & Hofmann, 2014). I have greater confidence in this updated manipulation than in the original 2002 manipulation. Indeed, when I first learned that the 2002 study would be the target of an RRR—and before I understood precisely how the RRR mechanism works—I had assumed that it would use this updated manipulation.” (PPS, 2016, p. 766).   Surprisingly, the potential problem with the original manipulation was never brought up during the planning of the replication study (FB discussion group).

commitment-priming-fb

Hui et al. (2014) also do not mention any concerns about the original manipulation.  They simply wrote “Adapting procedures from previous research (Finkel et al., 2002), participants in the high commitment prime condition answered three questions designed to activate thoughts regarding dependence and commitment.” (JPSP, 2014, p. 561).  The results of the manipulation check closely replicated the results of the 2002 article. “The analysis of the manipulation check showed that participants in the high commitment prime condition (M = 4.62, SD = 0.34) reported a higher level of relationship commitment than participants in the low commitment prime condition (M = 4.26, SD = 0.62), t(74) = 3.11, p < .01.” (JPSP, 2014, p. 561).  The study also produced a just-significant result for a predicted effect of the manipulation on support for partner’s goals that are incompatible with the relationship, relationship, beta = .23, t(73) = 2.01, p = .05.  These just significant results are rare and often fail to replicate in replication studies (OSC, Science, 2016).

Altogether the results of yet another registered replication report raise major concerns about the robustness of priming as a reliable method to alter participants’ beliefs and attitudes.  Selective reporting of studies that “worked” has created an illusion that priming is a very effective and reliable method to study social cognitions. However, even social cognition theories suggest that priming effects should be limited to specific situations and should not have strong effects for judgments that are highly relevant and when chronically accessible information is easily accessible.

8. Concluding Remarks

Looking back 2016 has been a good year for the reform movement in psychology.  High profile replication failures have shattered the credibility of established psychology.  Attempts by the establishment to discredit critics have backfired. A major problem for the establishment is that they themselves do not know how big the crisis is and which findings are solid.  Consequently, there has been no major initiative by the establishment to mount replication projects that provide positive evidence for some important discoveries in psychology.  Looking forward to 2017, I anticipate no major changes. Several registered replication studies are in the works, and prediction markets anticipate further failures.  For example, a registered replication report of “professor priming” studies is predicted to produce a null-result.

professor-priming-prediction

If you are still looking for a New Year’s resolution, you may consider signing on to Brent W. Roberts, Rolf A. Zwaan, and Lorne Campbell’s initiative to improve research practices. You may also want to become a member of the Psychological Methods Discussion Group, where you can find out in real time about major events in the world of psychological science.

Have a wonderful new year.

 

 

Z-Curve: Estimating Replicability of Published Results in Psychology (Revision)

Jerry Brunner and I developed two methods to estimate replicability of published results based on test statistics in original studies.  One method, z-curve, is used to provide replicabiltiy estimates in my powergraphs.

In September, we submitted a manuscript that describes these methods to Psychological Methods, where it was rejected.

We now revised the manuscript. The new manuscript contains a detailed discussion of various criteria for replicability with arguments why a significant result in an exact replication study is an important, if not the only, criterion to evaluate the outcome of replication studies.

It also makes a clear distinction between selection for significance in an original study and the file drawer problem in a series of conceptual or exact replication studies. Our methods only assumes selection for significance in original studies, but no file drawer or questionable research practices.  This idealistic assumption may explain why our model predicts a much higher success rate in the OSC reproducibility project (66%) than was actually obtained (36%).  As there is ample evidence for file-drawers with non-significant conceptual replication studies, we believe that file-drawers and QRP contribute to the low success rate in the OSC project. However, we also mention concerns about the quality of some replication studies.

We hope that the revised version is clearer, but fundamentally nothing has changed. Reviewers at Psychological Methods didn’t like our paper, the editor thought NHST is no longer relevant (see editorial letter and reviews), but nobody challenged our statistical method or the results of our simulation studies that validate the method. It works and it provides an estimate of replicability under very idealistic conditions, which means we can only expect a considerably lower success rate in actual replication studies as long as researchers file-drawer non-significant results.

 

How did Diedrik Stapel Create Fake Results? A forensic analysis of “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations”

Diederik Stapel represents everything that has gone wrong in social psychology.  Until 2011, he was seen as a successful scientists who made important contributions to the literature on social priming.  In the article “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations” he presented 8 studies that showed that social comparisons can occur in response to stimuli that were presented witout awareness (subliminally).  The results were published in the top journal of social psychology published by the American Psychological Association (APA) and APA published a press-release for the general public about this work.  
In 2011, an investigation into Diedrik Stapel’s reserach practices revealed scientific fraud, which resulted in over 50 retractions (Retraction Watch), including the article on unconscious social comparisons (Retraction Notice).  In a book, Diederik Stapel told his story about his motives and practices, but the book is not detailed enough to explain how particular datasets were fabricated.  All we know, is that he used a number of different methods that range from making up datasets to the use of questionable research practices that increase the chance of producing a significant result.  These practices are widely used and are not considered scientific fraud, although the end result is the same. Published results no longer provide credible empirical evidence for the claims made in a published article.
I had two hypothesis. First, the data could be entirely made up. When researchers make up fake data they are likely to overestimate the real effect sizes and produce data that show the predicted pattern much more clearly than real data would. In this case, bias tests would not show a problem with the data.  The only evidence that the data are fake would be that the evidence is stronger than in other studies that relied on real data. 
 
In contrast, a researcher who starts with real data and then uses questionable practices is likely to use as little dishonest practices as possible because this makes it easier to justify the questionable decisions.  For example, removing 10% of data may seem justified, especially if some rational for exclusion can be found.  However, removing 60% of data cannot be justified.  The researcher will need to use these practices to produce the desired outcome, namely a p-value below .05 (or at least very close to .05).  As more use of questionable practices is not needed and harder to justify, the researcher will stop producing stronger evidence.  As a result, we would expect a large number of just significant results.
There are two bias tests that detect the latter form of fabricating significant results by means of questionable statistical methods; the Replicability-Index (R-Index) and the Test of Insufficient Variance (TIVA).   If Stapel used questionable statistical practices to produce just significant results, R-Index and TIVA would show evidence of bias.
The article reported 8 studies. The table shows the key finding of each study.
Study Statistic p z OP
1 F(1,28)=4.47 0.044 2.02 0.52
2A F(1,38)=4.51 0.040 2.05 0.54
2B F(1,32)=4.20 0.049 1.97 0.50
2C F(1,38)=4.13 0.049 1.97 0.50
3 F(1,42)=4.46 0.041 2.05 0.53
4 F(2,49)=3.61 0.034 2.11 0.56
5 F(1,29)=7.04 0.013 2.49 0.70
6 F(1,55)=3.90 0.053 1.93 0.49
All results were interpreted as evidence for an effect and the p-value for Study 6 was reported as p = .05.
All p-values are below .053 but greater than .01.  This is an unlikely outcome because sampling error should produce more variability in p-values.  TIVA examines whether there is insufficient variability.  First, p-values are converted into z-scores.  The variance of z-scores due to sampling error alone is expected to be approximately 1.  However, the observed variance is only Var(z) = 0.032.  A chi-square test shows that this observed variance is unlikely to occur by chance alone,  p = .00035. We would expect such an extremely small variabilty or even less variability in only 1 out of 28,458 sets of studies by chance alone.
 
The last column transforms z-scores into a measure of observed power. Observed power is an estimate of the probability of obtaining a significant result under the assumption that the observed effect size matches the population effect size.  These estimates are influenced by sampling error.  To get a more reliable estimate of the probability of a successful outcome, the R-Index uses the median. The median is 53%.  It is unlikely that a set of 8 studies with a 53% chance of obtaining a significant result produced significant results in all studies.  This finding shows that the reported success rate is not credible. To make matters worse, the probability of obtaining a significant result is inflated when a set of studies contains too many significant results.  To correct for this bias, the R-Index computes the inflation rate.  With 53% probability of success and 100% success rate, the inflation rate is 47%.  To correct for inflation, the inflation rate is subtracted from median observed probability, which yields an R-Index of 53% – 47% = 6%.  Based on this value, it is extremely unlikely that a researcher would obtain a significant result, if they would actually replicate the original studies exactly.  The published results show that Stapel could not have produced these results without the help of questionable methods, which also means nobody else can reproduce these results.
In conclusion, bias tests suggest that Stapel actually collected data and failed to find supporting evidence for his hypotheses.  He then used questionable practices until the results were statistically significant.  It seems unlikely that he outright faked these data and intentionally produced a p-value of .053 and reported it as p = .05.  However, statistical analysis can only provide suggestive evidence and only Stapel knows what he did to get these results.

A sarcastic comment on “Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology” by Harry Reis

“Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology”
Journal of Experimental Social Psychology 66 (2016) 148–152
DOI: http://dx.doi.org/10.1016/j.jesp.2016.01.005

a.k.a The Swan Song of Social Psychology During the Golden Age

Disclaimer: i wrote this piece because Jamie Pennebeker recommended writing as therapy to deal with trauma.  However, in his defense, he didn’t propose publishing the therapeutic writings.

————————————————————————-

You might think an article with reproducibiltiy in the title would have something to say about the replicability crisis in social psychology.  However, this article has very little to say about the causes of the replication crisis in social psychology and possible solutions to improve replicability. Instead, it appears to be a perfect example of repressive coping to avoid the traumatic realization that decades of work were fun, yet futile.

1. Introduction

The authors start with a very sensible suggestion. “We propose that the goal of achieving sound scientific insights and useful applications will be better facilitated over the long run by promoting good scientific practice rather than by stressing the need to prevent any and all mistakes.”  (p. 149).  The only question is how many mistakes we consider tolerable and that we do not know what the error rates are. Rosenthal pointed out it could be 100%, which even the authors might consider to be a little bit too high.

2. Improving research practice”

In this chapter, the authors suggest that “if there is anything on which all researchers might agree, it is the call for improving our research practices and techniques.” (p. 149).  If this were the case, we wouldn’t see articles in 2016 that make statistical mistakes that have been known for decades like pooling data from a heterogeneous set of studies or computing difference scores and using one of the variables as a predictor of the difference score.

It is also puzzling to read “the contemporary literature indicates just how central methodological innovation has been to advancing the field” (p. 149), when the key problem of low power has been known since 1962 and there is still no sign of improvement.

The authors also are not exactly in favor of adapting better methods, when these methods might reveal major problems in older studies.  For example, a meta-analysis in 2010 might not have examined publication bias and produced an effect size of more than half a standard deviation, when a new method that controls for publication bias finds that it is impossible to reject the null-hypothesis. No, these new methods are not welcome. “In our  view, they will stifle progress and innovation if they are seen primarily through the lens of maladaptive perfectionism; namely as ways of rectifying flaws and shortcomings in prior work.”  (p. 149).  So, what is the solution. Let’s pretend that subliminal priming made people walk slower in 1996, but stopped working in 2011?

This ends the chapter of improving research practice.  Yes, that is the way to deal with a crisis.  When the city is bankrupt, cut back on the Christmas decorations. Problem solved.

3. How to think about replications

Let’s start with a trivial statement that is as meaningless as saying, we would welcome more funding.  “Replications are valuable.” (p. 149).  Let’s also not mention that social psychologists have been the leader of requesting replication studies. No single study article shall be published in a social psychology journal. A minimum of three studies with conceptual replications of the key finding are needed to show that the results are robust and always produce significant results with p < .05 (or at least p < .10).  Yes, no other science has cherished replications as much as social psychology.

And eminent social psychologists Crandall and Sherman explain why. “to be a cumulative
and self-correcting enterprise, replications, be their results supportive, qualifying, or contradictory, must occur.”  Indeed, but what explains the 95% success rate of published replications in social psychology.  No need for self-correction, if the predictions are always confirmed.

Surprisingly, however, since 2011 a number of replication studies have been published in obscure journals that fail to replicate results.  This has never happened before and raises some concerns. What is going on here?  Why can these researchers not replicate the original results?  The answer is clear. They are doing it wrong.  “We concur with several authors (Crandall and Sherman, Stroebe) that conceptual replications offer the greatest potential to our field…  Much of the current debate, however, is focused narrowly on direct
or exact replications.” (p. 149). As philosopher know, you cannot step into the same river twice and so you cannot replicate the same study again.  To get a significant result, you need to do a similar, but not an identical replication study.

Another problem with failed replication studies is that these researchers assume that they are doing an exact replication study, but do not test this assumption. “In this light, Fabrigar’s insistence that researchers take more care to demonstrate psychometric invariance is well-placed” (p. 149).  Once more, the superiority of conceptual replication studies is self-evident. When you do a conceptual replication study, psychometric invariance is guaranteed and does not have to be demonstrated. Just one more reason, why conceptual replication studies in social psychology journals produce 95% success rate, whereas misguided exact replication attempts have failure rates of over 50%.

It is also important to consider the expertise of researchers.  Social psychologists often have demonstrated their expertise by publishing dozens of successful, conceptual replications.  In contrast, failed replications are often produced by novices with no track-record of ever producing a successful study.  These vast differences in previous success rate need to be taken into account in the evaluation of replication studies.  “Errors caused by low expertise or inadvertent changes are often catastrophic, in the sense of causing a study to fail completely, as Stroebe nicely illustrates.”

It would be a shame if psychology would start rewarding these replication studies.  Already limited research funds would be diverted to conducting studies that are easy to do, yet to difficult to do correctly for inexperienced researchers away from senior researchers who do difficult novel studies that always work and produced groundbreaking new insights into social phenomena during the “golden age” (p. 150) of social psychology.

The authors also point that failed studies are rarely failed studies. When these studies are properly combined with successful studies in a meta-analysis, the results nearly always show the predicted effect and that it was wrong to doubt original studies simply because replication studies failed to show the effect. “Deeper consideration of the terms “failed” and “underpowered” may reveal just how limited the field is by dichotomous thinking. “Failed” implies that a result at p = .06 is somehow inferior to one at p = .05, a conclusion
that scarcely merits disputation.” (p. 150).

In conclusion, we learn nothing from replication studies. They are a waste of time and resources and can only impede further development of social psychology by means of conceptual replication studies that build on the foundations laid during the “golden age” of social psychology.

4. Differential demands of different research topics

Some studies are easier to replicate than others, and replication failures might be “limited to studies that presented methodological challenges (i.e., that had protocols that were considered difficult to carry out) and that provided opportunities for experimenter bias” (p. 150).  It is therefore better, not to replicate difficult studies or to let original authors with a track-record of success conduct conceptual replication studies.

Moreover, some people have argued that the high succeess rate of original studies is inflated by publication bias (not writing up failed studies) and the use of questionable research practices (run more participants until p < .05).  To ensure that reported successes are real successes, some initiatives call for data sharing, pre-registration of data analysis plans, and a priori power analysis.  Although these may appear to be reasonable suggestions, the authors disagree.  “We worry that reifying any of the various proposals as a “best practice” for research integrity may marginalize researchers and research areas that study phenomena or use methods that have a harder time meeting these requirements.” (p. 150).

They appear to be concerns that researchers who do not preregister data analysis plans or do not share data may be stigmatized. “If not, such principles, no matter how well-intentioned, invite the possibility of discrimination, not only within the field but also by decision-makers who are not privy to these realities.”  (p. 150).

5. Considering broader implications

These are confusing times.  In the old days, the goal of research was clearly defined. Conduct at least three, loosely related , successful studies and write them up with a good story.  During these times, it was not acceptable to publish failed studies to maintain the 95% success rate. This made it hard for researchers who did not understand the rules of publishing only significant results. “Recently, a colleague of ours relayed his frustrating experience of submitting a manuscript that included one null-result study among several studies with statistically significant findings. He was met with rejection after rejection, all the while being told that the null finding weakened the results or confused the manuscript” (p. 151).

It is not clear what researchers should be doing now. Should they now report all of their studies, the good, the bad, and the ugly, or should they continue to present only the successful studies?   What if some researchers continue to publish the good old fashioned way that evolved during the golden age of social psychology and others try to publish results more in accordance with what actually happened in their lab?  “There is currently, a disconnect between what is good for scientists and what is good for science” and nobody is going to change while researchers who report only significant results get rewarded with publications in top journals.

 

 

 

 

 

There may also be little need to make major changes. “We agree with Crandall and Sherman, and also Stroebe, that social psychology is, like all sciences, a self-correcting enterprise” (p. 151).   And if social psychology is already self-correcting, it do not need new guidelines how to do research and new replication studies. Rather than instituting new policies, it might be better to make social psychology great again. Rather than publishing means and standard deviations or test statistics that allow data detectives to check results, it might be better to report only whether a result was significant, p < .05, and because 95% of studies are significant and the others are failed studies, we might simply not report any numbers.  False results will be corrected eventually because they will no longer be reported in journals and the old results might have been true even if they fail to replicate today.   The best approach is to fund researchers with a good track record of success and let them publish in the top journals.

 

Most likely, the replication crisis only exists in the imagination of overly self-critical psychologists. “Social psychologists are often reputed to be among the most severe critics of work within their own discipline” (p. 151).  A healthier attitude is to realize that “we already know a lot; with these practices, we can learn even more” (p. 151).

So, let’s get back to doing research and forget this whole thing that was briefly mentioned in the title called “concerns about reproducibility.”  Who cares that only 25% of social psychology studies from 2008 could be replicated in 2014.  In the meantime, thousands of new discoveries were made and it is time to make more new discoveries. “We should not get so caught up in perfectionistic concerns that they impede the rapid accumulation and dissemination of research findings” (p. 151).

There you have it folks.  Don’t worry about recent failed replications. This is just a normal part of science, especially a science that studies fragile, contextually sensitive phenomena. The results from 2008 do not necessarily replicate in 2014 and the results from 2014 may not replicate in 2018.  What we need is fewer replications. We need permanent research because many effects may disappear the moment they were discovered. This is what makes social psychology so exciting.  If you want to study stable phenomena that replicate decade after decade you might as well become a personality psychologist.