Category Archives: Uncategorized

Response to Shiffrin’s Meta-Opinion with Meta-Science

The main problem with claims about the status of science ranging from “most published results are false” (Ioannidis, 2005) to “science is doing very well” (Shiffrin) is that they are not based on empirical facts. Everybody has an opinion, but opinions are cheap.

Shiffrin is a cognitive psychologists and it would be strange for him to comment on other sciences like social psychology that to the best of my knowledge he is not following.

The main empirical evidence about cognitive psychology is the replication rate in the Open Science Collaboration project. In this project about 30 studies were replicated and 50% produced a significant result in the replication attempt. i think it is fair to call this a glass half full and half empty. It doesn’t support claims that most published results in cognitive psychology are false positives, nor does it justify the claim that cognitive psychology is doing very well. We can simply ignore these unfounded proclamations.

Another piece of evidence comes from coding published results in the Journal of Experimental Psychology: Learning, Memory and Cognition. The project is ongoing, but so far a representative sample of tests from 2008, 2010, 2014, and 2017, as well as the most cited articles have been coded. The results have been analyzed with z-curve.19.1 (see detailed explanation of z-curve output here).

Consistent with the results form actual replication attempts, z-curve predicts that 57%, 95%CI 40-63%, of published results would produce a significant result if the study were replicated exactly with the same sample size.

Z-curve also estimates the maximum false discovery rate (Soric, 1980). The estimate is that up to 9% of all significant results could be false positive results, if the null-hypothesis is no effect or an effect in the opposite direction. Using a more liberal criterion that also includes studies with low power (< 17%) in the specification of the null-hypothesis, the false positive risk increases to 30%.

These results suggest that cognitive psychology is not in a crisis and publishes mostly replicable results. Thus, reality is closer to Shiffrin’s than Ioanndis’s opinion, but I would not characterize the state as very good. The z-curve graph shows clear evidence of publication bias and coding of articles reveals few replication studies with negative results that are needed to weed out false positives.

Given the common format of a multiple study article and low power, we would expect non-significant results regularly (Schimmack, 2012). Honest reporting of these results is crucial for the credibility of an empirical science (Sterling, 1959; Soric, 1989). It is time for cognitive psychology to increase power and to report their findings honestly.

Advertisements

Z-Curve ShinyApp

You can now do your own z-curve analysis with this shinyApp

https://zcurve.shinyapps.io/zcurve19/

Interpretation of a Zcurve Output

The basic principles of z-curve were outlined in Brunner and Schimmack (2018). This post explains the latest version of z-curve plots and the statistics obtained from it. The data used for this article stem from a project that codes a representative sample of focal hypothesis tests in the Journal of Experimental Psychology: Learning, Memory, and Cognition.

The Range information shows the full range of observed z-scores. Here z-scores range from 0.41 to 10. A value of 10 is the maximum because all larger z-scores were recoded to 10. Only z-scores less than 6 are shown because z-curve treats all studies with z-scores greater than 6 as having 100% power, and no estimation of power is needed.

There are 302 focal tests in the dataset and 273 are significant. The vertical, solid red line at z = 1.96 divides non-significant results on the left and significant results on the right side with alpha = .05 (two-tailed). The dotted red line at 1.65 is the boundary for marginally significant results, alpha = .10 (two-tailed). The green line at 2.8 implies 80% power with alpha = .05. If studies have an average power of 80%, the mode of the distribution should be here.

The main part of the figure is a histogram of the test statistics (F-values, t-values) converted into p-values and then converted into z-scores; z = qnorm(1-p/2). The solid blue line shows the density distribution for the significant z-scores with a default bandwidth of .05. The grey line shows the fit of the predicted density distribution based on the z-curve model. The grey line is extended into the range of non-significant results, which provides an estimate of the file-drawer of non-significant results that were not reported.

The observed discovery rate (DR) is the proportion of significant results that were observed in the set of k = 302 tests. A 95% confidence interval is given to provide information about the accuracy of this estimate. In this example the discovery rate is 90%, which is typical for psychology (Sterling, 1959; Sterling et al., 1959).

For all other results, a 95% confidence interval is obtained using bootstrapping with a default of 500 iterations.

The estimated discovery rate is the proportion of significant results that were observed compared, while taking the estimated file-drawer of significant results into account. The estimated discovery rate is only 38%.
A comparison of these two rates provides information about the amount of publication bias (Schimmack, 2012). As the observed discovery rate is much higher than the expected discovery rate, we can conclude that JEP-LMC selectively publishes significant results. This is consistent with the visual inspection of the file-drawer in the plot.

The file-drawer ratio is a simple conversion of the estimated discovery rate into a ratio of the size of the file-drawer to the proportion of significant results. It estimates how many non-significant results were obtained for every significant result; file.drawer.ratio = (1-EDR)/EDR. In this example, the ratio is 1.63:1, meaning there are 1.63 non-significant results for every significant result.

The latest addition to z-curve is Soric’s false discovery risk (FDR). Soric showed that it is possible to compute the maximum false discovery rate based on the assumption that all true discoveries were obtained with 100% power. If average power were less, the actual false discovery rate would be less than the stated false discovery risk; false discovery rate <= false discovery risk. Using Soric’s formula, FDR = (1/EDR)*(.05/.95), yields a false discovery risk of 9%. This means that no more than 9% of the significant focal tests (discoveries) in JEP-LMC are false positives.

Soric’s FDR defines a false discovery as a significant results with a population effect size of zero (i.e., the nil-hypothesis, Cohen, 1994). As a result, even studies with extremely small effect sizes and power that are difficult to replicate are treated as true positives. Z-curve addresses this problem by computing an alternative false discovery risk. Z-curve is fitted with fixed proportions of false positives and fit is compared to the baseline model with no restrictions on the percentage of false positives. Once model fit deviates notably from the baseline model, the model specified too many false positives. The model with the highest proportion of false positives that still has acceptable fit is used to estimate the maximum false discovery risk. The obtained value depends on the specification of other possible power values. In the default model, the lowest power for true positive results is 17% with a non-central z-score of 1. Lowering this value, would decrease the FDR and in the limit reach Soric’s FDR. Increasing this value would increase the FDR. The Z0-FDR estimate is considerably higher than Soric’s FDR, indicating that several studies with positive results are studies with very small effects. The 95%CI interval shows that up to 55% of published results could be false positives when very small effects are considered false positives. The drawback of this approach is that there is no clear definition of the effect sizes that are considered false positives.

The last, yet most important, estimate is the replication rate. The replication rate is the mean power of published results with a significant result. As power predicts the long-run proportion of significant results, mean power is an estimate of the replication rate if the set of studies were replicated exactly with the same sample size. Increasing sample sizes would increase the replication rate, while lowering sample sizes would decrease it. With the same sample sizes as in the original studies, articles in JEP-LMC are expected to have a replication rate of 57%. The 95%CI shows that this estimate includes the observed replication rate of 50% in the Open Science Collaboration project for cognitive studies (OSC, 2015). This result validates the replication rate estimate of z-curve with outcomes from actual replication studies.

References

Brunner, J. & Schimmack, U. (2018). Estimating population mean power under conditions of heterogeneity and selection for significance. Meta-Psychology.

Sorić, B. (1989) Statistical “Discoveries” and Effect-Size Estimation, Journal of the American Statistical Association, 84:406, 608-610, DOI: 10.1080/01621459.1989.10478811

Open Science Collaboration. 2015. Estimating the reproducibility of psychological science. Science 349:6251

Ioannidis (2005) was wrong: Most published research findings are not false

Fifteen years ago, Ioannidis (2005) sounded the alarm bells about the quality of published research findings. To be clear, I fully agree that research practices are lax and published results often have low credibility; but are they false?

To claim that most published results are false requires a definition of a true or false result. Ioannidis’s definition of a false result is clear and consistent with the logic of null-hypothesis testing. Accordingly, a research finding is false if it leads to the conclusion that there is an effect in the population without an actual effect in the population. This is typically called a false positive (FP) or a type-I error.

Table 1. Truth of Hypothesis by Outcome of Significance Table

  NSSIGSum
TRUEP(FN)P(TP)P(True)
FALSEP(TN)P(FP)P(False)
SumP(NS)P(SIG)k

Table 1 shows that the proportion of false positive result. P(FP) is the probability that a hypothesis is false and a significant result was obtained. P(FP) is controlled by the significance criterion. With the standard criterion of alpha = .05 and P(False) = 100, P(FP) = 5 and P(TN) = 95. That is, the significance tests leads to the wrong conclusion that an effect exists, when this is the case in 5% of all tests of a false hypothesis.

In the top row, the proportion of true positive results depends on the statistical power of a test. With 100% power and if all hypothesis are true, P(TP) would be 100%.

The quantity of interest is the proportion of false positives, P(FP) for all significant results, P(SIG) = P(FP) + P(TP). I call this quantity the false discovery rate; P(FP) / P(Sig).

This quantity depends on the proportion of true hypothesis, P(True) and false hypotheses, P(False). If all hypotheses that are being tested are false, the false discovery rate is 100%, 5 / (5 + 0) = 1. If all hypotheses that are being tested are true, the false discovery rate is 0; independent of the power of the test; 0 / (P(TP) + 0) = 0.

We can now state Ioannidis’s claim that most published research findings are false as the prediction that the false discovery rate is greater than 50%. This prediction is implied when Ioannidis’s writes “most research findings are false for most research designs and for most fields” because “in the described framework, a PPV exceeding 50% is quite difficult to get” (p. 699), where PPV stands for Positive Predictive Value, which is defined as the proportion of true positives among significant results; PPV = P(TP) / (P(Sig). Thus, Ioannidis clearly claims that most fields have a false discovery rate greater than 50% or a true discovery rate of less than 50%.

False Discovery Rate versus False Discovery Risk

The false discovery rate has also been called the false positive report probability (Wacholder et al., 2004). Table 1 makes it clear that the false discovery rate depends on the proportion of true and false hypotheses that are being tested. It is well known that it is impossible to provide conclusive evidence for the null-hypothesis that there is absolutely no effect (Cohen, 1994; Tukey, 1991). Thus, it is impossible to count the occurrences of true or false effects, and it is impossible to determine the false discovery rate. Thus calculations of the false discovery rate requires assumptions about the proportion of true and false hypotheses (Wacholder et al., 2004; Ioannidis, 2005).

In contrast, the false discovery risk (FDR) is the maximum proportion of significant results that can be false positives for an observed proportion of significant versus significant results. Thus, the false discovery risk can be calculated without assumptions. This statistical fact is demonstrated with a few examples and then derived from a formula that relates FDR to the discovery rate (P(sig) / k.

Table 2 shows a scenario where all tested hypotheses are false. In this case, all of the significant results are false positives. Table 2 also shows that it is easy to identify this scenario because the proportion of significant results, P(SIG)/k, matches the significance criterion.

Table 2. Truth of Hypothesis by Outcome of Significance Table

  NSSIGSum
TRUE000
FALSE955100
Sum955100

Thus, if the relative frequency of significant results matches the significance criterion (alpha), the false discovery risk is 100%. It is a risk because all hypothesis could be true positives and the studies have extremely low power (Table 3). Even with 100 true hypotheses, the success rate is indistinguishable from alpha.

Table 3. Truth of Hypothesis by Outcome of Significance Table

  NSSIGSum
TRUE94.095.01100
FALSE000
Sum94.095.01100

More interesting are scenarios when the percentage of significant results exceeds alpha. For this event to occur, at least some of the significant results must have tested a true hypothesis. The greater the number of significant results, the more true hypotheses must have been tested.

To determine the false discovery risk, we assume that all non-significant results stem from tests of false hypotheses and that tests of true hypotheses have 100% power. This scenario maximizes the false discovery rate because this scenario maximizes the number of false positives, given a fixed proportion of significant results.

Table 4. Truth of Hypothesis by Outcome of Significance Table

  NSSIGSum
TRUEP(FN) = 0P(True)P(True)
FALSEP(TN) =
(1-alpha)*P(False)
P(FP) =
alpha *P(False)
P(False)
SumP(NS) =
(1-alpha)*P(False)
P(Sig) =
PTrue+(alpha*P(False))
1

For example, if k = 100 and 40% of all hypotheses are true, P(True) = 40 and P(False) = 100-40 = 60. Given the assumption of 100% power, there are 40 true positives and zero false negatives. With 60 false hypotheses there are 60*.05 = 3 false positives and 57 true negatives. With 40 true positives and 3 false positives, the false discovery risk is 3/(40+3) = .07.

Importantly, the relationship is deterministic and it is possible to calculate the false discovery risk (FDR) from the observed discovery rate, P(Sig)/k, as shown in the following derivation based on the cells in Table 4.

P(Sig) = P(True) + alpha * P(False)

as P(True) = 1-P(False) we can rewrite

P(Sig) = P(True) + alpha*(1-P(True))

and solve for P(True)

P(Sig) = P(True) + alpha – alpha*P(True)

P(Sig) = P(True)*(1 – alpha) + alpha

P(Sig) – alpha = P(True)*(1-alpha)

(P(Sig) – alpha)/(1-alpha) = P(True)

We can now substitute P(True) in the formula for the FDR. Simplified the formula reduces to

FDR = (1/P.sig – 1)*(.05/.95), with alpha = .05

Figure 1 plots the FDR as a function of the discovery rate; DR = P(sig)/k

Figure 1 shows that the false discovery risk is above 50% when the discovery rate is below 9.5%.  Thus, Ioannidis’s claim that most published results are false implies that there are no more than 9.5 significant results for every 100 attempts. 

The novel contribution of shifting from rates to risk is clear when Ioannidis writes that “it is unavoidable that one should make approximate assumptions on how many relationships are expected to be
true among those probed across the relevant research fields and research
designs” (p. 701).  This makes calculations of rates dependent on assumptions that are difficult to verify.  However, the false discovery risk depends only on the assumption that power cannot exceed 1, which is true by definition. It is therefore possible to assess false discovery risks without speculating about proportions of true hypotheses. This makes it possible to test Ioannidis’s claim empirically by computing the false discovery risk based on observable discovery rates.

Ioannidis Scenario 1

Ioannidis presented a few hypothetical scenarios with a false discovery risk greater than 50%. The first scenario was called “meta-analysis of small inconclusive studies” and assumed 80% power, alpha = .05, 25% true hypothesis vs. 75% false hypotheses and a bias component of .4.  The bias component essentially changes alpha and beta from their nominal levels by using questionable research practices (John et al., 2012). This is evident when we fill out the 2 x 2 table for this scenario.  

Table 5. Scenario “Meta-analysis of small inconclusive studies”

  NSSIGSum
TRUE32225
FALSE433275
Sum4654100

The false positive rate for this scenario is 32/54 = 59%, which is above 50%. Due to bias, there are now 32 false positives for 75 true null-hypothesis, which implies a type-I error rate of 32/75 = 43%, while unbiased research would have produced only 3.75 false positives, 75*.05 = 3.75. To make this a plausible scenario, it is important to understand how researchers can inflate the rate of false positive results from 3.75 to 32.

One way to increase the percentage of false positives is to test the same hypothesis repeatedly and to publish only significant results. This bias is called publication bias.

“What is less well appreciated is that bias and the extent of repeated independent testing by different teams of investigators around the globe may further distort this picture and may lead to even smaller probabilities of the research findings being indeed true” (p. 697)

However, this form of bias assumes that researchers are conducting more studies than Table 5 implies. As 20 tests of a false hypothesis are needed to produce 1 false positive result, the actual number of studies would be much larger than 100. Moreover, repeated testing of true hypotheses does not alter power. The success rate is inflated by running more studies. Table 6 shows the frequencies assuming an actual alpha of 5% and power of 80% to to produce a false discovery rate of 59% (a PPV of 41%).

Table 6. Counting all Tests

  NSSIGSum
TRUE0.83.34.1
FALSE91.14.895.9
Sum91.98.1100

The false discovery rate is the same as in Table 5; 3.3/8.1 = 59%. Once all of the attempted studies are included, it is obvious that many more false hypothesis were tested than the 75% rate implies in the scenario because false hypotheses were tested repeatedly to produce a false positive result. As a result the discovery rate is not 54% as implied in Table 5, but only 8.1%. To verify that the numbers in Table 6 are correct, we can see that the real alpha is 4.8/95.9 = 5%, and power is 3.3/4.1 = 80%. In this scenario, the false discovery risk, 59.7, is only slight higher than the false discovery rate, 59.3.

The other scenarios constructed by Ioannidis have higher false discover rates or a lower positive predictive value (PPV = 1- False Discovery Rate). As a result, the observed discovery rates are also lower. In fact, they are close to 5%, which is close to the lower limit set by alpha. Evidently, discovery rates of 5 to 6 percent would raise red flags, if they were observed.

NamePPVDR
Underpowered, but well-performed phase I/II RCT23%6%
Underpowered, poorly performed phase I/II RCT17%6%
Adequately powered exploratory epidemiological study20%6%
Underpowered exploratory epidemiological study12%5%
Discovery-oriented exploratory research
with massive testing
0.001%5%
As in previous example, but with more
limited bias (more standardized)
0.0015%5%

In conclusion, if bias is modeled as repeated testing of false hypothesis, the discovery rates in Ioannidis’s scenarios are very low. It seems unrealistic that researchers have sufficient resources to conduct 100 studies to produce only 6 significant results.

Other Questionable Research Practices

Ioannadis also hints at fraud as a reason for a high false discovery rate.

Bias can entail manipulation in the analysis or reporting of findings. Selective or distorted reporting is a typical form of such bias.

However, there is little evidence that data manipulation is rampant (John et al., 2012). Ioannidis also suggests that researchers might exploit multiple dependent variables to report a significant result for at least one of them.

True findings may be more common when outcomes are unequivocal and universally agreed (e.g., death) rather than when multifarious outcomes are devised. (p. 698)

However, there is no need to think about entries in the 2 x 2 table as independent studies. Rather, the unit of analysis are statistical tests. Conducting multiple statistical tests in a single study is a more efficient way to produce true and false significant results. Alpha = .05 implies that there will be one false positive result for every 20 statistical tests of a false hypothesis. This is true, whether these tests are conducted in 20 separate studies or within a single study. The same applies to multiple independent variables. Thus, Ioannidis’s scenarios still imply that researchers observe only 5 or 6 significant results for every 100 statistical tests.

It follows that Ioannidis claim that “Most Research Findings Are False
for Most Research Designs and for Most Fields” implies that the long run discovery rate in most fields is at most 10 significant results for every statistical analysis. As discovery rates do not depend on unknown priors this prediction can be tested empirically.

What is the Actual Discovery Rate?

To test Ioannidis’s claim, it is necessary to know the discovery rates of scientific fields. Unfortunately, this information is harder to obtain than one might think because of the pervasive influence of publication bias (Sterling, 1959). For example, psychology journals publish 95% significant results. If we would take this discovery rate at face value, it would imply that the false discovery risk in psychology is 0.3%. The problem is that the reported discovery rate in psychology is not the actual discovery rate because non-significant results are often not reported.

Recently, Brunner and Schimmack (2018) developed a statistical method, z-curve, that makes it possible to estimate the number of unpublished non-significant results based on the statistical evidence of published significant results. The figure below illustrates the method with a representative sample of focal hypothesis tests in social psychology journals (Motyl et al., 2017).

The red vertical line at z = 1.96 represents the criterion for statistical significance with alpha = .05. The histogram of z-scores on the right side of z = 1.96 shows the distribution of observed z-scores with significant results. Z-curve fits a mixture model to the observed distribution. The fitted model is used to project the distribution into the range of non-significant results. The area under the grey curve shows the estimated size of the file-drawer with unpublished non-significant results. The file-drawer ratio shows how many non-significant studies are predicted for every published significant result. The ratio of 3.29:1 studies suggests that the discovery rate is 1/(1+3.29) = 23%, and with a discovery rate of 23%, the false discovery risk is 17.6%. This is considerably below Ioannidis’s prediction that false discovery rates are above 50% in most fields.

Other areas in psychology have higher discovery rates because they conduct studies with more power (Open Science Collaboration, 2015). The next graph shows results for a representative sample of focal tests published in the Journal of Experimental Psychology: Learning, Memory, and Cognition. As predicted, cognitive psychology has a smaller file-drawer of 1.63 non-significant results for every published significant result. This translates into an estimated discovery rate of 1/(1+1.63) = 38% and a false discovery risk of 8.6%.

Conclusion

This article makes several contributions to meta-science. First, it introduces the concept of false discovery risk as the maximum false discovery rate that is consistent with an observed discovery rate (i.e., the percentage of significant results). Second, it shows that the false discovery risk is determined by the discovery rate and can be estimated without making assumptions about the prior odds of hypotheses being true or false. Third, it is shown that Ionnidis’s claims that most published results are false translates into the prediction that discovery rates are below 10%. On the flip side, this means that fields with discovery rates over 10% do not publish more than 50% false positive results. It is also shown that most of Ioannidis’s scenarios for different research fields translate into discovery rates of 5 or 6 percent. This seems an implausible scenario for most fields of research. Finally, I compute false discovery risks for social psychology and cognitive psychology using z-curve (Brunner & Schimmack, 2018), which makes it possible to estimate the percentage of unpublished non-significant results based on published significant results. The estimated false discovery risk in these two research fields are 17.6% and 8.6%, respectively. These estimates contradict Ioannidis’s claim that most research fields publish more than 50% false discoveries. Thus, Ioannidi’s claim that most published results are false is itself false. More important, the article shows how it is possible to estimate false discovery risk of various fields, which makes it possible to evaluate fields in terms of their ability to produce true discoveries.

Social Psychology Textbook audiT: Prejudice Without Awareness

The concept of implicit bias has become an accepted idea in the general public (e.g., Starbucks’ closure in 2018). The idea that individuals could be prejudice without awareness originated in social psychology when social psychologists used cognitive paradigms to study social cognition.

Patricia Devine’s (1989) article continues to be cited as empirical evidence for the existence of unconscious prejudice.

Devine’s article also influenced the authors of the Implicit Association Test that is now widely used to measure “implicit racial bias”.

Experiment 3 was motivated by several previous demonstrations of automatic expressions of race-related stereotypes and attitudes that are consciously disavowed by the subjects who display them (Crosby, Bromley, & Saxe, 1980; Devine, 1989; Fazio et al., 1995; Gaertner & McLaughlin, 1983; Greenwald & Banaji, 1995; Wittenbrink, Judd, & Park, 1997)
(Greenwald, McGee, & Schwartz, 1998, p 1473).

Not surprisingly, it is also featured in social psychology textbooks.

Gilovich, Keltner, Chen, and Nisbett (2019) write “automatic and controlled processes can result in quite different attitudes in the same person toward members of outgroups” (Devine, 1989a, 1989b; Devine, Montehith, Zuwerink, Elliot, 1991; Devine, Forscher, Austin, & Cox, 2012; Devine, Plant, Amodio, Harmon-Jones, & Vance, 2002)” (p. 15).

Myers and Twenge (2018) write “Patricia Devine and her colleagues (1989, 2012; Forscher et al., 2015) report that people low and high in prejudice sometimes have similar automatic (unintentional) prejudicial responses. The result: Unwanted (dissonant) thoughts and feelings often persist. Breaking the prejudice habit is not easy” (p. 258).

The idea that individuals can have different conscious and unconscious aspects of personality goes back to old psychoanalytic theories. However, social psychologists claim that they have scientific evidence to support this claim.

A great many studies have shown that stimuli presented outside of awareness can prime a schema sufficiently to influence subsequent information processing (Bargh, 1996; Debner & Jacoby, 1994; Devine, 1989b; Draine & Greenwald, 1998; Ferguson, 2008; Ferguson, Bargh, & Nayak, 2005; Greenwald, Klinger, & Liu, 1989; Klinger, Burton, & Pitts, 2000; Lepore & Brown, 1997; Welsh & Ordonez, 2014)” (Gilovich et al., p. 122).

Gilovich et al. (1989) give a detailed description of Devine’s study on pages 391 and 392.

Patricia Devine (1989b) examined the joint operation of these automatic and controlled processes by investigating the schism that exists for many people between their knowledge of racial stereotypes and their own beliefs and attitudes toward those same groups. More specifically, Devien sought to demonstrate that what separates prejudiced and nonprejudiced people is not their knowledge of derogatory stereotypes, but whether they resist those stereotypes. To carry out her investigation, Devine relied on the distinction between controlled processes, which we direct more consciously, and automatic processes, which we do not consciously control. The activation of stereotypes is typically an automatic process; thus, stereotypes can be triggered even if we don’t want them to be. Even a nonprejudiced person will, under the right circumstances, access an association between say, Muslims and fanaticism, blacks and criminality, and WASPs and emotional repression, because those associations are present in our culture. Whereas a bigot will endorse or employ such stereotypes, a non-prejudiced person will employ more controlled cognitive processes to discard or suppress them – or at least try to.

To test these ideas, Devine selected groups selected groups of high- and low-prejudiced participants on the basis of their scores on the Modern Racism Scale (Devine, 1989b). To show that these two groups don’t differ in their automatic processing of stereotypical information – that is, that the same stereotypes are triggered in both high-prejudiced and low-prejudiced people – she presented each participant with a set of words, one at a time, so briefly that the words could not be consciously identified. Some of them saw neutral words (number, plant, remember) and others saw words stereotypically associated with blacks (welfare, jazz, busing). Devine hypothesized that although the stereotypical words were presented too briefly to be consciously recognized , they would nonetheless prime the participants’ stereotypes of blacks. To test this hypothesis, she presented the participants with a written description of an individual who acted in an ambiguously hostile manner (a feature of the African-American stereotype). In one incident, for example, the person refused to pay his rent until his apartment was repaired. Was he being needlessly belligerent or appropriately assertive?

The textbook describes the results as follows.

The results indicated that he was seen as more hostile – and more negative overall – by participants who had earlier been primed by words designed to activate stereotypes of blacks (words such as jazz, its’ important to note, that are not otherwise connected to the concept of hostility). Most important, this result was found equally for prejudiced and non-prejudiced participants.

Fact Checking

The sample consisted of 78 White subjects in the judgment condition (p. 11)

The description of the priming words in the textbook leaves out that derogatory, racists terms were included in the list of primes.

Replication 1 primes included the following: nigger, poor, afro, jazz, slavery, musical, Harlem, busing, minority, oppressed, athletic, and prejudice. Replication 2 primes included the following: Negroes, lazy, Blacks, blues,
rhythm, Africa, stereotype, ghetto, welfare, basketball, unemployed,
and plantation (p. 10).

The textbook also does not describe the conditions accurately. Rather than comparing 100% to 0% words related to African Americans, the lists included 80% or 20% stereotypic and racists stimuli. Thus, even participants in the control condition were primed, but less often.

The mean ratings were submitted to a mixed-model ANOVA, with prejudice level (high vs. low), priming (20% vs. 80%), and replication (1 vs. 2) as between-subjects variables and scale (hostility related vs. hostility unrelated) as a within-subjects variable. The analysis revealed that the Priming X Scale interaction was significant, F(1, 70) = 5.04, p < .03 (p. 11).

The description of the results makes it impossible to compute a standardized effect size (standard deviations are not reported). The p-value is just significant and published results with p-values close to .05 often do not replicate (Open Science Collaboration, 2015).

Moreover, the results do not show that high-prejudice and low-prejudice participants independently show the effect. In fact, it is unlikely that follow-up tests would be significant because the overall effect is just significant, and power to get significant results decreases when each group is tested individually.

The analysis on hostility-related scales revealed only a significant priming main effect, F(l, 70) = 7.59, p < .008. The Prejudice Level x Priming interaction was nonsignificant, F(l, 70) = 1.19, p = .28.

Devine also makes the mistake to interpret a non-significant result as evidence for the absence of an effect. That is, the interaction between prejudice levels and priming was not significant, p = .28. This finding is used to support the claim that both groups show the same priming effect. However, an alternative explanation is that there is a difference between the groups, but the statistical test failed to show it (a false negative result or a type-II error). Again, to demonstrate that low-prejudice subjects were influenced by the priming manipulation, it would have been better to test the priming effect in the low-prejudice group alone. This was not done. To make matters worse, means are not reported separately for each group, so that it is impossible to test this hypothesis post-hoc. As a result, the article provides no empirical evidence for the claim that low-prejudice individuals’ responses were influenced by subliminal activation of stereotypes.

The lack of empirical evidence in this seminal study would not be a problem, if replication studies had provided better evidence for Devine’s claims that are featured in the textbook. However, follow-up studies have produced different results. These follow-up studies are not mentioned on pages 391-393, although Lepore and Brown (1997) were mentioned earlier on page 122. The reason for the omission on pages 391-393 is that Lepore and Brown’s (1997) findings contradict Devine’s claim that unconscious bias is the same for high and low prejudice individuals.

Lepore and Brown

The article by Lepore and Brown (1997) is cited much less frequently than Devine’s (1989) article.

Study 2 and 3 of their article are conceptual replication studies of Devine’s study. The results seem to show that subliminal stereotype activation is possible, but they also contradict Devine’s claim that the effect is the same for individuals who score high or low on a prejudice measure.

Study 2 differed from Devine’s study in the type of priming stimuli that were used.

In the prime condition, 13 words evocative of the category Black people were used. They were category labels themselves and neutral associates of the category, based on free responses in pretesting. The words used were as follows: Blacks, Afro-Caribbean, West Indians, colored, afro, dreadlocks, Rastafarian, reggae, ethnic, Brixton, Notting Hill,3 rap, and culture.

The sample size was small with 51 participants who were not selected from a screening task. Groups were formed by a median split. Thus, the groups differed much less in prejudice levels than those in Devine’s study.

The statistical analysis showed a 3-way interaction, F(1,47) = 6.07, p < .02, that was again just significant.

High-prejudice participants in the prime condition rated the target person more extremely on the negative construct (Ms = 6.76 vs. 5.88), t(46) = 3.43, p < .005 and less extremely on the positive construct (Ms = 6.31 vs. 6.88), ?(46) = 2.22, p < .025. Low-prejudice participants increased their ratings on the positive scales (Ms = 6.98 vs. 6.54), ;(46) = 1.69, p < .05, but showed no difference on the negative ones (Ms = 5.65 vs. 5.73).

These results are inconsistent with Devine, who claimed equal effects of primes for low and high prejudice participants (without showing evidence for it).

Study 3: A Conceptual Replication of Devine (1989)

The experiment was designed with 13 priming words. Three were category labels (i.e., Blacks, West Indians, and Afro-Caribbean), six were negative (i.e., nigger, rude, dirty, crime, unemployed, and drugs), and the remaining four were evocative of the category (i.e., dreadlocks, reggae, Brixton, and ethnic).

The sample size for this study was small (N = 45) and a median split was used to define groups of high and low prejudice.

The means show the pattern predicted by Devine that both groups increased negative ratings after priming with racist primes.

High-prejudice participants in fact significantly increased their
ratings on the negative scales comparing the prime and no-prime
conditions, r(40) = 2.62, p < .01

However, …

The same comparison was not significant in the low-prejudice group, r(40) = 1.30, p < .10 [one-tailed].

Thus, even this conceptual replication study failed to provide evidence that low-prejudice are affected by subliminal priming with racist primes.

Moreover, all of these published results are just significant. This is an unlikely outcome because statistical results are highly variable and should produce some non-significant and some highly significant results. When all p-values are clustered into the region of just significant results, it suggests that the published studies were selected from a larger set of studies that failed to produce significant results. Thus, it is unclear how robust these findings really are.

Although Devine’s study had a huge influence on social psychology and the notion of implicit racial bias, there are no credible, unbiased replication studies of this study. Moreover, subliminal priming in general may not be a robust and replicable phenomenon. However, social psychology textbooks hide these problems from students, and present unconscious bias as a scientifically proven reality. This blog post shows that the scientific evidence is much less consistent and robust than textbooks imply.

How many False Discoveries are Published in Psychology?

For decades psychologists have ignored statistics because the only knowledge required was that p-values less than .05 can be published and p-values greater than .05 cannot be published. Hence, psychologists used statistics programs to hunt for significant results without understanding the meaning of statistical significance.

Since 2011, psychologists are increasingly recognizing that publishing only significant results is a problem (cf. Sterling, 1959). However, psychologists are confused what to do instead. Many do not even know how to interpret p-values or what p < .05 means, as reflected by repetitive posts on social media that suggest p-values or significance testing provides no information.

First, it doesn’t require a degree in math to understand what p < .05 means. The criterion value of alpha = .05 sets the upper limit for a false positive result. For directional hypotheses this means that no more than 5% of all hypothesis tests can produce a significant result, p < .05 when the population effect is 0 or in the opposite direction from the effect suggested by the sample means (or the sign of the correlation in the sample).

That is, if a significant correlation in a sample is positive, the probability that the correlation in the population is zero or negative is at most 5%. Some readers will jump up and say that this statement ignores the prior probability of hypotheses being true or false. Please calm down and take a seat. The statement is not that the probability is exactly 5%. The exact probability is unknown. What is known is that the maximum probability is 5%; it could be less, but it cannot be more.

  NSSIGSum
TRUE000
FALSE955100
Sum955100

This is quiet obvious when we look at probabilities in terms of long-run frequencies. The maximum probability of false positive results is reached when all hypothesis that are being tested are FALSE. In this case, there are zero true positives (TRUE & SIG) and five false positives (FALSE & SIG). Thus, the relative frequency of false positives in the set of all tests (k = 100) is 5/100 = 0.05.

Importantly, this is not an empirical observation. The maximum probability of false positive results is set by alpha and holds in the limit under the assumption that the statistical tests were conducted properly.

It is also important that the use of a long-run frequency to estimate probability assumes that we have no additional information about the study. For example, if we know that 19 studies before this study tested the same hypothesis and produced non-significant results, the probability of a false positive would be a lot higher. As I am not concerned about probabilities of single studies, but rather the risk of false discoveries in sets of studies, the controversy between Bayesians and Frequentists is irrelevant. Even with prior knowledge about hypotheses being true or false, we cannot expect more than 5% false positive results with alpha = .05.

A valid criticism of claiming p < .05 as an important finding is that we are not interested in the percentage of false positives for ALL statistical tests. We are rather more interested in the percentage of false discoveries. That is, how many of the significant results could be false positives?

It would be easy to answer this question if all hypothesis tests were published (Sterling, 1959). In this case, we would have information about the total number of significant results as a proportion of all statistical tests. In the table above, we see that we only have 5 significant results with 100 attempts. This is not very assuring because we would expect 5 significant results by chance alone. Thus, the risk that the significant results are false discoveries is 5/5 = 100%.

It is interesting to examine scenarios with more discoveries. The next example shows 10 discoveries. As some of these discoveries are true discoveries, the percentage of false positives has to be less than 5. I used trial and error to find the maximum number of false positives.

  NSSIGSum
TRUE05.35.3
FALSE904.794.7
Sum9010100

The maximum percentage of false hypotheses is 94.7, which produce 4.7%  false positives. The remaining 5.3% tests of true hypothesis contribute 5.3% significant results with 100% power.  It is easy to see that this is a maximum because power cannot exceed 100%. Or stated differently, the type-II error probability (TRUE & NS) cannot be less than zero.  The false discovery risk is 4.7 / (4.7 + 5.3) = 4.7 / 10 = 47%.   Again, this is not an estimate of the actual percentage of false discoveries, which is unknown. It is an estimate of the maximum number of false discoveries given the observation that 10 out of 100 hypothesis were significant. 

The false discovery risk decreases quickly when more significant results are observed.  

 NSSIGSum
TRUE015.815.8
FALSE804.284.2
Sum8020100

With 20 significant results, the false discovery risk is 4.2/20 = 21%.

The following table shows the relationship between percent of significant results (discoveries) and the false discovery risk. 

DiscoveriesFalse Discovery Risk
5100
1047
1530
2021
2516
3012
3510
408
456
505

The table suggests that researchers should aim for discovery rates (percentage of significant results) of 50% or more to keep the false discovery risk below 5%.

Estimating the False Discovery Risk in Psychology

The previous section showed that it is easy to estimate the maximum false discovery risk. The only problem to apply this approach is that the discovery rate in psychology is largely unknown because psychologists only publish significant results and provide no information about the number of attempts that were made to get these significant results (Sterling, 1959).

Brunner and Schimmack (2018) developed z-curve; a statistical approach for estimating the percentage of missing non-significant results based on the test-statistic of published significant results. Following Rosenthal (1979) these missing studies are called the file-drawer. Applying z-curve to focal tests of eminent social psychologists yields an estimate of 5 studies in file-drawers for every published significant result. This means the discovery rate is 17% (1 / (5 + 1)). Looking up the false discovery risk in the table suggests that up to 30 published results could be false positives (the estimate of 55% in the figure below is based on a different definition of a false positive)

Implications

This blog post explains what p-values mean and how they can be interpreted as the maximum long-run probability to obtain a false positive result. However, it is important not to confuse the percentage of false positives with the false discovery risk. One percentage is FP / k. The other is FP / (FP + TP).

The blog post also shows how we can estimate the maximum false discovery risk based on alpha and the discovery rate (i.e., the percentage of significant results). This is much more meaningful, but this information is typically not available. As Sterling (1959) pointed out, if only significant results are published, they become meaningless because the false discovery risk is unknown. Thus, psychologists must start reporting the number of attempts they made to make their empirical results meaningful.

While psychology journals publish only discoveries, statistical estimation is the only way to estimate the false discovery risk in psychology. I presented one example how the discovery rate can be estimated and what implications the estimate has for the false discovery risk in social psychology, where the false discovery risk is estimated to be 30%.

It is important to realize that false discoveries are based on the definition of a mistake about the sign or direction of an effect. Results with trivial effect sizes in the right direction are considered true positives. Thus, even a false discovery rate of 30% does not mean that 70% of all published results have practical significance, nor does it mean that 70% of published results can be replicated. Brunner and Schimmack are working on an alternative method that would treat extremely low powered studies with true effects as false discoveries. This method produced an estimate of 55% false discoveries for eminent social psychologists.

The estimate of 30% false discoveries for social psychology suggests that Ioannidis (2005) was wrong when he claimed that most published results are wrong. His claims were based on hypothetical scenarios that are unrealistic. The present estimates are based on actual data and suggest that false discovery risks are less than 50%. Of course, even false discovery risks of 30% are unacceptably high, but Ioannidis’s made a strong claim about false results without empirical support. I showed how this claim can be tested and I presented data that suggest it is wrong. Moreover, social psychology is the worst discipline in psychology. Thus, estimates for other areas of psychology are likely to be lower. This would mean that most published results in psychology are not false in the sense that they reported a false positive result.

More important is the demonstration that we do not need to make assumptions about the prior probability of hypotheses being right or wrong to make claims about false discovery risks. All we need is alpha and the discovery rate.

I am sure nothing I said is original in statistics, but it is original in the context of the endless debates about p-values and their interpretation in the social sciences. Psychology does not need new statistics, it needs credible information about the discovery rates in psychology, and for that we need to end a culture of reporting only significant results.

Social Psychology Textbook audiT: Culture of Honor

The danger of bias is probably greatest when textbook writers write about their own research. Gilovich, Keltner, Chen, and Nisbett (2019; Social Psychology, 5th edition) could have chosen many research topics to illustrate how social psychologists conduct research, but they chose Nisbett’s work on the culture of honor.

We tie the methods of social psychology together by showing how many of them can be applied to a single problem: the nature of the “culture of honor.” (p. x).

The authors further suggest that this chapter is “oriented toward providing the critical thinking skills that are the hallmark of social psychology” (p. x).

We show how the tools of social psychology can be used to critique research in the behavioral and medical sciences that students encounter online and in magazines and newspapers.

Importantly, they do not promise to provide tools to think critically about social psychology or about studies presented in their textbook. Presumably, they are flawless.

CHAPTER 2

Chapter 2 starts with field experiments by Cohen and Nisbett (1997). The authors sent fictitious job applications to business owners in the North and the South of the United States. The applicant admitted that he had been convicted of a felony. The textbook describes the results as follows.

Cohen and Nisbett found some distinct patterns in the replies. Retailers from the South complied with the applicant’s requests more than reatilers from the North. And the notes from Southern business owners were much warmer and more sympathetic than those from the North.

The table from the original article shows that some of these differences were not statistically significant.

This would provide an opportunity to teach students about research practices in social psychology. How should a p-value of .06 be interpreted? Would the study have been published if all of the p-values were greater than .10?

The next example illustrates the use of surveys.

Nisbett and Cohen (1996) used surveys to try to find out why U.S. Southerners were more likely to commit homicide.

Southerners were more likely to favor violence in response to insults and to think that a man would be justified to fight an acquaintance who “looks over his girlfriend and talks to her in a suggestive way.”

No further information is provided how many homicides are due to insults or flirting or whether incidence rates for these types of homicides differ between the North and the South. The reason is probably that social psychologists rarely conduct correlational studies and do not value these types of studies.

The best way to be sure about causality is to conduct an experiment (p. 47)

The power of experiments is illustrated with Cohen, Nisbett, Bowdle, & Schwarz’s (1996) studies of Northerners and Southerns reactions to an insult under standardized laboratory conditions (fortunately, no homicides were committed).

Participants were randomly assigned to an “Insult” and a control condition. In the insult condition, a confederate of the experimenter had to make room for the participant and acted annoyed; that is, he slammed a file-drawer shut and called the participant an A**hole.

The textbook describes the results as follows.

First, observers noted the participants’ immediate reactions after the insult. Insulted Southerners usually showed a flash of anger; insulted Northerners were more likely to shrug their shoulders or to appear amused.

Second, participants were asked to read a story in which a man made a pass at another man’s fiancee and then to provide an ending to the story. Southerners who had been insulted were much more likely to provide a violent ending to the story than Southerners who hadn’t been insulted, whereas the endings provided by Northerners were unaffected by the insult.

Third, the participants’ level of testosterone, the hormone that mediates aggression in males, was tested both before and after the insult. The level of testosterone increased for Southerners who had been insulted, but it did not increase for Southerners who hadn’t been insulted or for Northerners, whether insulted or not.

Forth, participants were asked to walk back down the narrow hallway, and this time another assistant to the experimenter walked toward the participant. The assistant was very tall and muscular and his instructions were to talk down the middle of the hall, forcing the participant to doge out of his way. The dependent variable was how far away the participant was when he finally swerved out of the assistant’s way. The investigators thought that the insulted Southerners would be put into such an aggressive mood that they would play “chicken” with the assistant, waiting until the last moment to swerve aside. And indeed they did. Northerners, whether insulted or not, served aside at a distance of about 5 feet (1.4 meters) from the assistant. Southerners, who are known for their politeness, stood aside at around 9 feet (2.75 meters) if not insulted, but pushed ahead until 3 feet away (less than 1 meter) if they had been insulted.

One could not imagine a more perfect result; four dependent variables all confirmed the authors’ expectations. Unfortunately, this is exactly what students are reading. An authors’ idealized recollection of studies with a lot more dependent variables and more mixed results.

Fact Check 1. Insulted Southerners usually showed a flash of anger. 

Study 1
Southern participants tended to be more angry than northern participants (northern M = 2.34, southern M= 3.05), F(41) = 1.61, .10 < p < .15.
[A result with p > .10 is very rarely presented as marginally significant].
Experiment 2 (as well as Experiment 3, which is reported subsequently) yielded weak and inconsistent results regarding the emotional reaction to the bump

In short, none of the three studies showed a significant result supporting the textbook claim.

Fact Check 2.  Insulted Southerners’ Violent Ending

To examine the interaction between region and insult, we performed an analysis of variance (ANOVA) on a three-level variable (no violence, violence suggested, actual violence). Higher numbers indicated greater violence, and means were: southern insult = 2.30, southern control =
1.40, northern insult = 1.73, and northern control = 2.05, interaction
F
( 1,78) = 7.65, p<. 005.

This result was not replicated in Study 2 with a different scenario.

There was no effect for region, insult, or the interaction on
whether participants expected the ambiguous scenarios to end
with either physical or verbal aggression (all Fs < 1).

Fact Check 3. Hormone Levels

Hormones were only assessed in Study 2. Cortisol and testosterone showed just significant effects in an analysis with planned contrasts.

It is only after provocation that we expected southerners to show cortisol increases over the level of northerners and over the level of control groups. The appropriate contrast to test this prediction is +3, – 1 , – 1 , – 1 . This contrast— indicating that the effect of the insult was seen only for
southerners, not for northerners—described the data well and
was significant, t( 165) = 2.14, p < .03

As may be seen in Figure 2, testosterone levels rose 12% for insulted southerners and 4% for control southerners. They rose 6% for insulted northerners and 4% for control northerners. Again, we used the +3, – 1 , – 1 , -1 contrast indicating that change was expected only for insulted southerners. The contrast was significant at p < .03, /(165) = 2.19.’ *

Fact Check 4. Chicken Game

The +3, – 1 , – 1 , – 1 interaction contrast was significant, p < .001, t(142) = 3.45. 

The chicken game is the only finding with a clear statistical result. However, even results like this need to be replicated to be convincing, especially if many dependent variables were used.

The authors also comment on the dependent variables that did not produce significant results that are not mentioned in the textbook.

It also is important to note that there were several measures—
the neutral projective hostility tasks of Experiment 1,
the ambiguous insult scenarios of Experiment 2, the shockacceptance
measure of Experiment 2, and the masculine protest
items of Experiment 3—on which northerners and southerners
were nor differentially affected by the insult. These null
results suggest that the insult did not create a generalized hostility
or perceived threat to self that colored everything southern
participants did or thought. Measures that were irrelevant
or ambiguous with respect to issues of affront and status,
that were uninvolving because they were paper-and-pencil,
and that were ecologically unnatural did not show an effect
of the insult. Instead, the effect of the affront was limited to
situations that concerned issues of honor, were emotionally
involving, and had actual consequences for the participant’s
masculine status and reputation.

Here the authors make the mistake to interpret patterns of significant findings and non-significant findings without testing whether these findings are significantly different from each other. Given how weak some of the significant results are, this is unlikely.

Chapter 2 ends with a section on replication (see Schimmack, 2018). In this section, the authors emphasize the importance of replication studies.

One of the ways in which science is different from other modes of inquiry is the importance placed on replication. (p. 54).

However, the textbook finding does not mention any replication studies of Cohen et al.’s findings. An article from 2014 (The Lost Cause? Examining the Southern Culture of Honor Through Defensive Gun Use) lists three articles by Cohen and colleagues as evidence from ethnographic experiments.

Closer inspection of the 1998 and 1999 articles shows that they were not experiments, but survey studies. Thus, to the best of my knowledge, the featured experiments were never replicated to see whether the results can be reproduced.

Conclusion 

The textbook gives the impression that clear results from an experiment provide important information about the causes of differences in homicide rates between the North and the South of the USA. Closer inspection shows that the results are far from clear. In addition, they provide only circumstantial evidence regarding the causes of homicide. The ability of laboratory experiments to illuminate causes of real-world phenomena like homicides is exaggerated. Even if Southerners respond with more aggression to insults, it does not mean that they are more willing to kill in these situations.

Ironically, the choice of this study to illustrate methods in social psychology is instructive. Social psychologists like to tell interesting stories and use data when they fit the story to give them the allure of being scientific. When the data do not fit the story, they are usually not reported. Students who received any introduction to the scientific method may realize that this is not how science works.

Replicability Audit of Norbert Schwarz

“Trust is good, but control is better”  

INTRODUCTION

Information about the replicability of published results is important because empirical results can only be used as evidence if the results can be replicated.  However, the replicability of published results in social psychology is doubtful.

Brunner and Schimmack (2018) developed a statistical method called z-curve to estimate how replicable a set of significant results are, if the studies were replicated exactly.  In a replicability audit, I am applying z-curve to the most cited articles of psychologists to estimate  the replicability of their studies.

Norbert Schwarz

Norbert Schwarz is an eminent social psychologist (H-Index in WebofScience = 49).

He is best known for his influential article on “Mood as information” (Schwarz & Clore, 1983) that suggested people use their momentary mood to judge their life-satisfaction.  This claim has been challenged by life-satisfaction researchers (e.g., Eid & Diener, 2004), but until recently there were no major replication attempts of the study.  Recently, Yap et al. (2018) published 9 studies that failed to replicate this famous finding. 

In collaboration with Strack, Schwarz also published two articles that demonstrated strong item-order effects on life-satisfaction judgments. These studies were featured in Nobel Laureates book “Thinking Fast and Slow.” (cf. Schimmack, 2018).  However, these results also have failed to replicate in numerous studies (Schimmack & Oishi, 2005). Most recently, a large multi-lab replication project also failed to replicate the effect (ManyLabs2). 

Schwarz is also known for developing a paradigm to show that people rely on the ease of recalling memories to make social judgments.  Once more, a large replication study failed to replicate the result (cf. Schimmack, 2019). 

Given this string of replication failures, it is of interest to see the average replicability of Schwarz’s published results.  

Data

I used WebofScience to identify the most cited articles by Norbert Schwarz  (datafile).  I then selected empirical articles until the number of coded articles matched the number of citations, resulting in 48 empirical articles (H-Index = 48). 

Norbert Schwarz co-authored several articles with Lawrence Sanna, who resigned from his academic job under doubts of data manipulation (Young, 2012). However, the articles co-authored with Norbert Schwarz have not been retracted and contribute to Schwarz’s H-Index. Therefore, I included the articles in the analysis. 

The 48 articles reported 109 studies.  The total number of participants was 28,606 with a median of 85 participants per study.  For each study, I identified the most focal hypothesis test (MFHT); 4 studies did not report a focal test; and 1 study reported a failure to replicate a finding without a statistical result.  The result of the test was converted into an exact p-value and the p-value was then converted into a z-score.  The z-scores were submitted to a z-curve analysis to estimate mean power of the 95 results that were significant at p < .05 (two-tailed).  The remaining 9 results were interpreted as evidence with lower standards of significance. Thus, the success rate for 114 reported hypothesis tests was 100%.  This high success rate is common in psychology (Sterling, 1959). 

The z-curve estimate of replicability is 39% with a 95%CI ranging from 21% to 56%.  Thus, z-curve predicts that only 39% of these studies would produce a significant result if they were replicated exactly. This estimate is consistent with the average for social psychology. However, actual replication attempts have an even lower success rate of 25% (Open Science Collaboration, 2015).

The histogram of z-values shows the distribution of observed z-scores (blue line) and the predicted density distribution (grey line). The predicted density distribution is also projected into the range of non-significant results.  The area under the grey curve is an estimate of the file drawer of studies that are needed to produce the observed distribution of significant results. Approximately 3 unpublished studies with non-significant results are expected for each published significant result.

This estimate has important implications for the risk of a false-positive result. A 5%-significance level ensures that no more than 5% of all studies can be false positives (i.e., the effect size is exactly zero or in the opposite direction). This information is useless when only significant results are published (Sterling, 1959). With an estimate of the file-drawer, we see that about 400 studies were needed to produce 100 significant results. Thus, the real risk of false positive results is 400*5% = 20 studies. Thus, 20 of the 100 significant results could be false positives.

Z-curve also provides another measure of the maximum number of false positives. Z-curve is fitted to the data with fixed percentages of false positives. As long as these models still fit the data, it is possible that false positives contributed to the significant results. This approach suggests that no more than 40% of the significant results are strictly false positives. Given the small number of studies, the estimate is not very precise (95%CI = 10-70%).

Although the false-positive results suggest that many reported results are not false positives, some of the true positives may be positives with trivial effect sizes and difficult to replicate. Z-curve provides information, which results are likely to replicate based on the strength of the evidence against the null-hypotheses. The local estimates of power below the x-axis show that z-scores between 2 and 2.5 have only a mean power of 29%. These results are least likely to replicate. Only z-scores greater than 3.5 start having a replicability of more than 50%. The datafile shows which studies fall into this category.

CONCLUSION

The analysis of Norbert Schwarz’s published results provides clear evidence that questionable research practices were used and that the size of the file-drawer suggests that up to 20% of the significant results could be false positives.

It is important to emphasize that Norbert Schwarz and colleagues followed accepted practices in social psychology and did nothing unethical by the lax standards of research ethics in psychology. That is, he did not commit research fraud.

The low average replicability is also consistent with estimates for social psychology, especially when the focus is on between-subject experiments.

DISCLAIMER 

It is nearly certain that I made some mistakes in the coding of Norbert Schwarz’s articles. However, it is important to distinguish consequential and inconsequential mistakes. I am confident that I did not make consequential errors that would alter the main conclusions of this audit. However, control is better than trust and everybody can audit this audit.  The data are openly available and the z-curve code is also openly available.  Thus, this replicability audit is fully transparent and open to revision.

If you found this audit interesting, you might also be interested in other replicability audits (Replicability Audits).