Dr. Ulrich Schimmack’s Blog about Replicability

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication (Cohen, 1994).

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DEFINITION OF REPLICABILITY:  In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2017 Blog Posts:

(October, 24, 2017)
Preliminary 2017 Replicability Rankings of 104 Psychology Journals

(September 19, 2017)
Reexaming the experiment to replace p-values with the probability of replicating an effect

(September 4, 2017)
The Power of the Pen Paradigm: A Replicability Analysis

(August, 2, 2017)
What would Cohen say: A comment on p < .005 as the new criterion for significance

(April, 7, 2017)
Hidden Figures: Replication failures in the stereotype threat literature

(February, 2, 2017)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
REPLICABILITY REPORTS:  Examining the replicability of research topics

RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?
RR No3. (September 4, 2017) The power of the pen paradigm: A replicability analysis

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

TOP TEN LIST

RR.Logo

1.  Preliminary 2017  Replicability Rankings of 104 Psychology Journals
Rankings of 104 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2017, and a comparison of psychological disciplines (cognitive, clinical, social, developmental, biological).

weak

2.  Z-Curve: Estimating replicability for sets of studies with heterogeneous power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

Say-No-to-Doping-Test-Image

3. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

snake-oil

8.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

Advertisements

A Clarification of P-Curve Results: The Presence of Evidence Does Not Imply the Absence of Questionable Research Practices

This post is not a criticism of p-curve.  The p-curve authors have been very clear in their writing that p-curve is not designed to detect publication bias.  However, numerous articles make the surprising claim that they used p-curve to test publication bias.  The purpose of this post is to simply correct a misunderstanding of p-curve.

Questionable Research Practices and Excessive Significance

Sterling (1959) pointed out that psychology journals have a surprisingly high success rate. Over 90% of articles reported statistically significant results in support of authors’ predictions.  This success rate would be surprising, even if most predictions in psychology are true.  The reason is that the results of a study are not only influenced by cause-effect relationships.  Another factor that influences the outcome of a study is sampling error.  Even if researchers are nearly always right in their predictions, some studies will fail to provide sufficient evidence for the predicted effect because sampling error makes it impossible to detect the effect.  The ability of a study to show a true effect is called power.  Just like bigger telescopes are needed to detect more distant stars with a weaker signal, bigger sample sizes are needed to detect small effects (Cohen, 1962; 1988).  Sterling et al. (1995) pointed out that the typical power of studies in psychology does not justify the high success rate in psychology journals.  In other words, the success rate was too good to be true.  This means, published articles are selected for significance.

The bias in favor of significant results is typically called publication bias (Rosenthal, 1979).  However, the term publication bias does not explain the discrepancy between estimates of statistical power and success rates in psychology journals.  John et al. (2012) listed a number of questionable research practices that can inflate the percentage of significant results in published articles.

One mechanism is simply to not report non-significant result.  Rosenthal (1979) suggested that non-significant results end up in the proverbial file-drawer.  That is, a whole data set remains unpublished.  The other possibilities is that researchers use multiple exploratory analyses to find a significant result and do not disclose their fishing expedition.  These practices are now widely known as p-hacking.

Unlike John et al. ,(2012), the p-curve authors make a big distinction between not disclosing an entire dataset (publication bias) and not disclosing all statistical analyses of a dataset (p-hacking).

QRP = Publication Bias + P-Hacking

We Don’t Need Tests of Publication Bias

The p-curve authors assume that publication bias is unavoidable.

“Journals tend to publish only statistically significant evidence, creating a scientific record that markedly overstates the size of effects. We provide a new tool that corrects for this bias without requiring access to nonsignificant results.”  (Simonsohn, Nelson, Simmons, 2014).

“By the Way, of Course There is Publication Bias. Virtually all published studies are significant (see, e.g., Fanelli, 2012; Sterling, 1959; Sterling, Rosenbaum, & Weinkam,
1995), and most studies are underpowered (see, e.g., Cohen, 1962). It follows that a considerable number of unpublished failed studies must exist. With this knowledge already in hand, testing for publication bias on paper after paper makes little
sense” (Simonsohn, 2012, p. 597).

“Yes, p-curve ignores p>.05 because it acknowledges that we observe an unknowably small and non-random subset of p-values >.05.”  (personal email, January 18, 2015).

I hope these quotes make it crystal clear that p-curve is not designed to examine publication bias because the authors assume that selection for significance is unavoidable.  Any statistical test that reveals no evidence of publication bias is a false negative result because the sample size was not large enough to detect it.

Another concern by Uri Simonsohn is that bias tests may reveal statistically significant bias that has no practical consequences.

Consider a literature with 100 studies, all with p < .05, but where the implied statistical
power is “just” 97%. Three expected failed studies are missing. The test from the critiques would conclude there is statistically significant publication bias; its magnitude, however, is trivial. (Simonsohn, 2012, p. 598). 

k.sig = 100; k.studies = 100; power = .97; pbinom(k.studies-k.sig,k.studies,1-power) =
0.048.

This is a valid criticism that applies to all p-values.  A p-value only provides information about the contribution of random sampling error.  A p-value of .048 suggest that it is unlikely to observe only significant results, even if 100 studies have 97% power to show a significant result.   However, with 97% observed power, the 100 studies provide credible evidence for an effect and even the inflation of the average effect size is minimal.

A different conclusion would follow from a p-value less than .05 in a set of 7 studies that all show significant results.

k.sig = 7; k.studies = 7; power = .62; pbinom(k.studies-k.sig,k.studies,1-power) = 0 .035

Rather than showing small bias with a large set of studies, this finding shows large bias with a small set of studies.  P-values do not distinguish between these two scenarios. Both outcomes are equally unlikely.  Thus, information about the probability of an event should always be interpreted in the context of the effect.  The effect size is simply the difference between the expected and observed rate of significant results.  In Simonsohn’s example, the effect size is small (1 – .97 = .03).  In the second example, the discrepancy is large (1 – .62 = .38).

The previous scenarios assume that only significant results are reported. However, in sciences that use preregistration to reduce deceptive publishing practices (e..g, medicine), non-significant results are more common.  When non-significant results are reported, bias tests can be used to assess the extent of bias.

For example, a literature may report 10 studies with only 4 significant results and the median observed power is 30%.  In this case, the bias is small (.40 – .30 = .10) and a conventional meta-analysis would produce only slightly inflated estimates of the average effect size.  In contrast, p-curve would discard over 50% of the studies because it assumes that the non-significant results are not trustworthy.  This is an unnecessary loss of information that could be avoided by testing for publication bias.

In short, p-curve assumes that publication bias is unavoidable. Hence, tests of publication bias are unnecessary and non-significant results should always be discarded.

Why Do P-Curve Users Think P-Curve is a Publication Bias Test?

Example 1

I conducted a literature research on studies that used p-curve and I was surprised by numerous claims that p-curve is a test of publication bias.

Simonsohn, Nelson, and Simmons (2014a, 2014b, 2016) and Simonsohn, Simmons, and Nelson (2015) introduced pcurve as a method for identifying publication bias (Steiger & Kühberger, 2018, p. 48).   

However, the authors do not explain how p-curve detects publication bias. Later on, they correctly point out that p-curve is a method that can correct for publication bias.

P-curve is a good method to correct for publication bias, but it has drawbacks. (Steiger & Kühberger, 2018, p. 48).   

Thus, the authors seem to confuse detection of publication bias with correction for publication bias.  P-curve corrects for publication bias, but it does not detect publication bias; it assumes that publication bias is present and a correction is necessary.

Example 2

An article in the medical journal JAMA Psychiatry also claimed that they used p-curve and other methods to assess publication bias.

Publication bias was assessed across all regions simultaneously by visual inspection of funnel plots of SEs against regional residuals and by using the excess significance test,  the P-curve method, and a multivariate analogue of the Egger regression test (Bruger & Howes, 2018, p. 1106).  

After reporting the results of several bias tests, the authors report the p-curve results.

P-curve analysis indicated evidential value for all measures (Bruger & Howes, 2018, p. 1106).

The authors seem to confuse presence of evidential value with absence of publication bias. As discussed above,  publication bias can be present even if studies have evidential value.

Example 3

To assess publication bias, we considered multiple indices. Specifically, we evaluated Duval and Tweedie’s Trim and Fill Test, Egger’s Regression Test, Begg and Mazumdar Rank Correlation Test, Classic Fail-Safe N, Orwin’s Fail-Safe N, funnel plot symmetry, P-Curve Tests for Right-Skewness, and Likelihood Ratio Test of Vevea and Hedges Weight-Function Model.

As in the previous example, the authors confuse evidence for evidential value (significant right-skwed p-curve) with evidence for the absence of publication bias.

Example 4

The next example even claims that p-curve can be used to quantify the presence of bias.

Publication bias was investigated using funnel plots and the Egger regression asymmetry test. Both the trim and fill technique (Duval & Tweedie, 2000) and p-curve (Simonsohn, Nelson, & Simmons, 2014a, 2014b) technique were used to quantify the presence of bias (Korrel et al., 2017, p. 642).

The actual results section only reports that the p-curve is right skewed.

The p-curve for the remaining nine studies (p < .025) was significantly right skewed
(binomial test: p = .002; continuous test full curve: Z = -9.94, p < .0001, and half curve Z = -9.01, p < .0001) (Korrel et al., 2017, p. 642)

These results do not assess or quantify publication bias.  One might consider the reported z-scores a quantitative measure of evidential value as larger z-scores are less probable under the nil-hypothesis that all significant results are false positives. Nevertheless, strong evidential value (e.g., 100 studies with 97% power) does not imply that publication bias is absent, nor does it mean that publication bias is small .

A set of 1000 studies with 10% power is expected to produce 900 non-significant results and 100 significant results.  Removing the non-significant results produces large publication bias, but a p-curve analysis shows strong evidence against the nil-hypothesis that all studies are false positives.

set.seed(3)
Z = rnorm(1000,qnorm(.10,1.96))
Stouffer.Z = sum(Z[Z > 1.96]-1.96)/sqrt(length(Z.sig))
Stouffer.Z = 4.89

The reason is that p-curve is a meta-analysis and the results depend on the strength of evidence in individual studies and the number of studies.  Strong evidence can be result of many studies with weak evidence or a few studies with strong evidence.  Thus, p-curve is a meta-analytic method that combines information from several small studies to draw inferences about a population parameter.  The main difference to older meta-analytic methods is that older methods assumed that publication bias is absent, whereas p-curve assumes that publication bias is present. Neither method assesses whether publication bias is present, nor do they quantify the amount of publication bias.

Example 5

Sala and Gobet (2017) explicitly make the mistake to equate evidence for evidence with evidence against publication bias.

Finally, a p-curve analysis was run with all the p values < .05 related to positive effect sizes (Simonsohn, Nelson, & Simmons, 2014). The results showed evidential values (i.e., no evidence of publication bias), Z(9) = -3.39, p = .003.  (p. 676).

As discussed in detail before, this is not a valid inference.

Example 6

Ironically, the interpretation of p-curve results as evidence that there is no publication bias contradicts the fundamental assumption of p-curve that we can safely assume that publication bias is always. present.

The danger is that misuse of p-curve as a test of publication bias may give the false impression that psychological scientists are reporting their results honestly, while actual bias tests show that this is not the case.

It is therefore problematic if authors in high impact journals (not necessarily high quality journals) claim that they found evidence for the absence of publication bias based on a p-curve analysis.

To check whether this research field suffers from publication bias, we conducted p-curve analyses (Simonsohn, Nelson, & Simmons, 2014a, 2014b) on the most extended data set of the current meta-analysis (i.e., psychosocial correlates of the dark triad traits), using an on-line application (www.p-curve.com). As can be seen in Figure 2, for each of the dark triad traits, we found an extremely right-skewed p-curve, with statistical tests indicating that the studies included in our meta-analysis, indeed, contained evidential value (all ps < .001) and did not point in the direction of inadequate evidential value (all ps non-significant). Thus, it is unlikely that the dark triad literature is affected by publication bias (Muris, Merckelbach, Otgaar, & Meijer, 2017).

Once more, presence of evidential value does not imply absence of publication bias!

Evidence of P-Hacking  

Publication bias is not the only reason for the high success rates in psychology.  P-hacking will also produce more significant results than the actual power of studies warrants. In fact, the whole purpose of p-hacking is to turn non-significant results into significant ones.  Most bias tests do not distinguish between publication bias and p-hacking as causes of bias.  However, the p-curve authors make this distinction and claim that p-curve can be used to detect p-hacking.

Apparently, we should not assume that p-hacking is just as prevalent as publication bias, which makes testing for p-hacking irrelevant.

The problem is that it is a lot harder to distinguish p-hacking and publication bias as the p-curve authors imply and that their p-curve test of p-hacking will only work under very limited conditions.  Most of the time, the p-curve test of p-hacking will fail to provide evidence for p-hacking and this result can be misinterpreted as evidence that results were obtained without p-hacking, which is a logical fallacy.

This mistake was made by Winternitz, Abbate, Huchard, Havlicek, & Gramszegi (2017).

Fourth and finally, as bias for publications with significant results can rely more on the P-value than on the effect size, we used the Pcurve method to test whether the distribution of significant P-values, the ‘P-curve’, indicates that our studies have evidential value and are free from ‘p-hacking’ (Simonsohn et al. 2014a, b).

The problem is that the p-curve test of p-hacking only works when evidential value is very low and for some specific forms of p-hacking. For example, researchers can p-hack by testing many dependent variables. Selecting significant dependent variables is no different from running many studies with a single dependent variable and selecting entire studies with significant results; it is just more efficient.  The p-curve would not show the left-skewed p-curve that is considered diagnostic of p-hacking.

Even a flat p-curve would merely show lack of evidential value, but it would be wrong to assume that p-hacking was not used.  To demonstrate this I submitted the results from Bem’s (2011) infamous “feeling the future” article to a p-curve analysis (http://www.p-curve.com/).  pcurve.bem.png

The p-curve analysis shows a flat p-curve.  This shows lack of evidential value under the assumption that questionable research practices were used to produce 9 out of 10 significant (p < .05, one-tailed) results.  However, there is no evidence that the results are p-hacked if we were to rely on a left-skewed p-curve as evidence for p-hacking.

One possibility would be that Bem did not p-hack his studies. However, this would imply that he ran 20 studies for each significant result. with sample sizes of 100 particpants per study, this would imply that he tested 20,000 participants.  This seems unrealistic and Bem states that he reported all studies that were conducted.  Moreover, analyses of the raw data showed peculiar patterns that suggest some form of p-hacking was used.  Thus, this example shows that p-curve is not very effective in revealing p-hacking.

It is also interesting that the latest version of p-curve, p-curve4.06, no longer tests for left-skewedness of distributions and doesn’t mention p-hacking.  This change in p-curve suggests that the authors realized the ineffectiveness of p-curve in detecting p-hacking (I didn’t ask the authors for comments, but they are welcome to comment here or elsewhere on this change in their app).

It is problematic if meta-analysts assume that p-curve can reveal p-hacking and infer from a flat or right-skewed p-curve that the data are not p-hacked.  This inference is not warranted because absence of evidence is not the same as evidence of absence.

Conclusion

P-curve is a family of statistical tests for meta-analyses of sets of studies.  One version is an effect size meta-analysis; others test the nil-hypothesis that the population effect size is zero.  The novel feature of p-curve is that it assumes that questionable research practices undermine the validity of traditional meta-analyses that assume no selection for significance. To correct for the assumed bias, observed test statistics are corrected for selection bias (i.e., p-values between .05 and 0 are multiplied by 20 to produce p-values between 0 and 1 that can be analyzed like unbiased p-values).  Just like regular meta-analysis, the main result of a p-curve analysis is a combined test-statistic or effect size estimate that can be used to test the nil-hypothesis.  If the nil-hypothesis can be rejected, p-curve analysis suggests that some effect was observed.  Effect size p-curve also provides an effect size estimate for the set of studies that produced significant results.

Just like regular meta-analyses, p-curve is not a bias test. It does not test whether publication bias exists and it fails as a test of p-hacking under most circumstances. Unfortunately, users of p-curve seem to be confused about the purpose of p-curve or make the logical mistake to infer from the presence of evidence that questionable research practices (publication bias; p-hacking) are absent. This is a fallacy.  To examine the presence of publication bias, researchers should use existing and validated bias tests.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Can the Bayesian Mixture Model Estimate the Percentage of False Positive Results in Psychology Journals?

A method revolution is underway in psychological science.  In 2011, an article published in JPSP-ASC made it clear that experimental social psychologists were publishing misleading p-values because researchers violated basic principles of significance testing  (Schimmack, 2012; Wagenmakers et al., 2011).  Deceptive reporting practices led to the publication of mostly significant results, while many non-significant results were not reported.  This selective publishing of results dramatically increases the risk of a false positive result from the nominal level of 5% that is typically claimed in publications that report significance tests  (Sterling, 1959).

Although experimental social psychologists think that these practices are defensible, no statistician would agree with them.  In fact, Sterling (1959) already pointed out that the success rate in psychology journals is too high and claims about statistical significance are meaningless.  Similar concerns were raised again within psychology (Rosenthal, 1979), but deceptive practices remain acceptable until today (Kitayama, 2018). As a result, most published results in social psychology do not replicate and cannot be trusted (Open Science Collaboration, 2015).

For non-methodologists it can be confusing to make sense of the flood of method papers that have been published in the past years.  It is therefore helpful to provide a quick overview of methodological contributions concerned with detection and correction of biases.

First, some methods focus on effect sizes, (pcurve2.0; puniform), whereas others focus on strength of evidence (Test of Excessive Significance; Incredibility Index; R-Index, Pcurve2.1; Pcurve4.06; Zcurve).

Another important distinction is between methods that assume a fixed parameter and methods that assume heterogeneity.   If all studies have a common effect size or the same strength of evidence,  it is relatively easy to demonstrate bias and to correct for bias (Pcurve2.1; Puniform; TES).  However, heterogeneity in effect sizes or sampling error produces challenges.  Relatively few methods have been developed for this challenging, yet realistic scenario.  For example, Ioannidis and Trikalonis (2005) developed a method to reveal publication bias that assumes a fixed effect size across studies, while allowing for variation in sampling error, but this method can be biased if there is heterogeneity in effect sizes.  In contrast, I developed the Incredibilty-Index (also called Magic Index) to allow for heterogeneity in effect sizes and sampling error (Schimmack, 2012).

Following my work on bias detection in heterogeneous sets of studies, I started working with Jerry Brunner on methods that can estimate average power of a heterogeneous set of studies that are selected for significance.  I first published this method on my blog in June 2015, when I called it post-hoc power curves.   These days, the term Zcurve is used more often to refer to this method.  I illustrated the usefulness of Zcurve in various posts in the Psychological Methods Discussion Group.

In September, 2015 I posted replicability rankings of social psychology departments using this method. the post generated a lot of discussions and a question about the method.  Although the details were still unpublished, I described the main approach of the method.  To deal with heterogeneity, the method uses a mixture model.

EJ.Mixture.png

In 2016, Jerry Brunner and I submitted a manuscript for publication that compared four methods for estimating average power of heterogeneous studies selected for significance (Puniform1.1; Pcurve2.1; Zcurve & a Maximul Likelihood Method).  In this article, the mixture model, Zcurve, outperformed other methods, including a maximum-likelihood method developed by Jerry Brunner. The manuscript was rejected from Psychological Methods.

In 2017, Gronau, Duizer, Bakker, and Eric-Jan Wagenmakers published an article titled “A Bayesian Mixture Modeling of Significant p Values: A Meta-Analytic Method to Estimate the Degree of Contamination From H0”  in the Journal of Experimental Psychology: General.  The article did not mention z-curve, presumably because it was not published in a peer-reviewed journal.

Although a reference to our mixture model would have been nice, the Bayesian Mixture Model differs in several ways from Zcurve.  This blog post examines the similarities and differences between the two mixture models, it shows that BMM fails to provide useful estimates with simulations and social priming studies, and it explains why BMM fails. It also shows that Zcurve can provide useful information about replicability of social priming studies, while the BMM estimates are uninformative.

Aims

The Bayesian Mixture Model (BMM) and Zcurve have different aims.  BMM aims to estimate the percentage of false positives (significant results with an effect size of zero). This percentage is also called the False Discovery Rate (FDR).

FDR = False Positives / (False Positives + True Positives)

Zcurve aims to estimate the average power of studies selected for significance. Importantly, Brunner and Schimmack use the term power to refer to the unconditional probability of obtaining a significant result and not the common meaning of power as being conditional on the null-hypothesis being false. As a result, Zcurve does not distinguish between false positives with a 5% probability of producing a significant result (when alpha = .05) and true positives with an average probability between 5% and 100% of producing a significant result.

Average unconditional power is simply the percentage of false positives times alpha plus the average conditional power of true positive results (Sterling et al., 1995).

Unconditional Power = False Positives * Alpha + True Positives * Mean(1 – Beta)

Zcurve therefore avoids the thorny issue of defining false positives and trying to distinguish between false positives and true positives with very small effect sizes and low power.

Approach 

BMM and zcurve use p-values as input.  That is, they ignore the actual sampling distribution that was used to test statistical significance.  The only information that is used is the strength of evidence against the null-hypothesis; that is, how small the p-value actually is.

The problem with p-values is that they have a specified sampling distribution only when the null-hypothesis is true. When the null-hypothesis is true, p-values have a uniform sampling distribution.  However, this is not useful for a mixture model, because a mixture model assumes that the null-hypothesis is sometimes false and the sampling distribution for true positives is not defined.

Zcurve solves this problem by using the inverse normal distribution to convert all p-values into absolute z-scores (abs(z) = -qnorm(p/2).  Absolute z-scores are used because F-tests or two-sided t-tests do not have a sign and a test score of 0 corresponds to a probability of 1.  Thus, the results do not say anything about the direction of an effect, while the size of the p-value provides information about the strength of evidence.

BMM also transforms p-values. The only difference is that BMM uses the full normal distribution with positive and negative z-scores  (z = qnorm(p)). That is, a p-value of .5 corresponds to a z-score of zero; p-values greater than .5 would be positive, and p-values less than .5 are assigned negative z-scores.  However, because only significant p-values are selected, all z-scores are negative in the range from -1.65 (p = .05, one-tailed) to negative infinity (p = 0).

The non-centrality parameter (i.e., the true parameter that generates the sampling dstribution) is simply the mean of the normal distribution. For the null-hypothesis and false positives, the mean is zero.

Zcurve and BMM differ in the modeling of studies with true positive results that are heterogeneous.  Zcurve uses several normal distributions with a standard deviation of 1 that reflects sampling error for z-tests.  Heterogeneity in power is modeled by varying means of normal distributions, where power increases with increasing means.

BMM uses a single normal distribution with varying standard deviation.  A wider distribution is needed to predict large observed z-scores.

The main difference between Zcurve and BMM is that Zcurve either does not have fixed means (Brunner & Schimmack, 2016) or has fixed means, but does not interpret the weight assigned to a mean of zero as an estimate of false positives (Schimmack & Brunner, 2018).  The reason is that the weights attached to individual components are not very reliable estimates of the weights in the data-generating model.  Importantly, this is not relevant for the goal of zurve to estimate average power because the weighted average of the components of the model is a good estimate of the average true power in the data-generating model, even if the weights do not match the weights of the data-generating model.

For example, Zcurve does not care whether 50% average power is produced by a mixture of 50% false positives and 50% true positives with 95% power or 50% of studies with 20% power and 50% studies with 80% power. If all of these studies were exactly replicated, they are expected to produce 50% significant results.

BMM uses the weights assigned to the standard normal with a mean of zero as an estimate of the percentage of false positive results.  It does not estimate the average power of true positives or average unconditional power.

Given my simulation studies with zcruve, I was surprised that BBM solved a problem that weights of individual components cannot be reliably estimated because the same distribution of p-values can be produced by many mixture models with different weights.  The next section examines how BMM tries to estimate the percentage of false positives from the distribution of p-values.

A Bayesian Approach

Another difference between BMM and Zcurve is that BMM uses prior distributions, whereas Zcurve does not.  Whereas Zcurve makes no assumptions about the percentage of false positives, BMM uses a uniform distribution with values from 0 to 1 (100%) as a prior.  That is, it is equally likely that the percentage of false positives is 0%, 100%, or any value in between.  A uniform prior is typically justified as being agnostic; that is, no subjective assumptions bias the final estimate.

For the mean of the true positives, the authors use a truncated normal prior, which they also describe as a folded standard normal.  They justify this prior as reasonable based on extensive simulation studies.

Most important, however, is the parameter for the standard deviation.  The prior for this parameter was a uniform distribution with values between 0 and 1.   The authors argue that larger values would produce too many p-values close to 1.

“implausible prediction that p values near 1 are more common under H1 than under H0” (p 1226). 

But why would this be implausible.  If there are very few false positives and many true positives with low power, most p-values close to 1 would be the result of  true positives (H1) than of false positives (H0).

Thus, one way BMM is able to estimate the false discovery rate is by setting the standard deviation in a way that there is a limit to the number of low z-scores that are predicted by true positives (H1).

Although understanding priors and how they influence results is crucial for meaningful use of Bayesian statistics, the choice of priors is not crucial for Bayesian estimation models with many observations because the influence of the priors diminishes as the number of observations increases.  Thus, the ability of BMM to estimate the percentage of false positives in large samples cannot be explained by the use of priors. It is therefore still not clear how BMM can distinguish between false positives and true positives with low power.

Simulation Studies

The authors report several simulation studies that suggest BMM estimates are close and robust across many scenarios.

The online supplemental material presents a set of simulation studies that highlight that the model is able to accurately estimate the quantities of interest under a relatively broad range of circumstances”  (p. 1226).

The first set of simulations uses a sample size of N = 500 (n = 250 per condition).  Heterogeneity in effect sizes is simulated with a truncated normal distribution with a standard deviation of .10 (truncated at 2*SD) and effect sizes of d = .45, .30, and .15.  The lowest values are .35, .20, and .05.  With N = 500, these values correspond to  97%, 61%, and 8% power respectively.

d = c(.35,.20,.05); 1-pt(qt(.975,500-2),500-2,d*sqrt(500)/2)

The number of studies was k = 5,000 with half of the studies being false positives (H0) and half being true positives (H1).

Figure 1 shows the Zcurve plot for the simulation with high power (d = .45, power >  97%; median true power = 99.9%).

Sim1.png

The graph shows a bimodal distribution with clear evidence of truncation (the steep drop at z = 1.96 (p = .05, two-tailed) is inconsistent with the distribution of significant z-scores.  The sharp drop from z = 1.96 to 3 shows that there are many studies with non-significant results are missing.  The estimate of unconditional power (called replicability = expected success rate in exact replication studies) is 53%.  This estimate is consistent with the simulation of 50% studies with a probability of success of 5% and 50% of studies with a success probability of 99.9% (.5 * .05 + .5 * .999 = 52.5).

The values below the x-axis show average power for  specific z-scores. A z-score of 2 corresponds roughly to p = .05 and 50% power without selection for significance. Due to selection for significance, the average power is only 9%. Thus the observed power of 50% provides a much inflated estimate of replicability.  A z-score of 3.5 is needed to achieve significance with p < .05, although the nominal p-value for z = 3.5 is p = .0002.  Thus, selection for significance renders nominal p-values meaningless.

The sharp change in power from Z = 3 to Z = 3.5 is due to the extreme bimodal distribution.  While most Z-scores below 3 are from the sampling distribution of H0 (false positives), most Z-scores of 3.5 or higher come from H1 (true positives with high power).

Figure 2 shows the results for the simulation with d = .30.  The results are very similar because d = .30 still gives 92% power.  As a result, replicabilty is nearly as high as in the previous example.

Sim2.png

 

The most interesting scenario is the simulation with low powered true positives. Figure 3 shows the Zcurve for this scenario with an unconditional average power of only 23%.

Sim3.png

It is no longer possible to recognize two sampling distributions and average power increases rather gradually from 18% for z = 2, to 35% for z = 3.5.  Even with this challenging scenario, BMM performed well and correctly estimated the percentage of false positives.   This is surprising because it is easy to generate a similar Zcurve without false positives.

Figure 4 shows a simulation with a mixture distribution but the false positives (d = 0) have been replaced by true positives (d = .06), while the mean for the heterogeneous studies was reduced to from d = .15 to d = .11.  These values were chosen to produce the same average unconditional power (replicability) of 23%.

Sim4.png

I transformed the z-scores into (two-sided) p-values and submitted them to the online BMM app at https://qfgronau.shinyapps.io/bmmsp/ .  I used only k = 1,500 p-values because the server timed me out several times with k = 5,000 p-values.  The estimated percentage of false positives was 24%, with a wide 95% credibility interval ranging from 0% to 48%.   These results suggest that BMM has problems distinguishing between false positives and true positives with low power.   BMM appears to be able to estimate the percentage of false positives correctly when most low z-scores are sampled from H0 (false positives). However, when these z-scores are due to studies with low power, BMM cannot distinguish between false positives and true positives with low power. As a result, the credibility interval is wide and the point estimates are misleading.

BMM.output.png

With k = 1,500 the influence of the priors is negligible.  However, with smaller sample sizes, the priors do have an influence on results and may lead to overestimation and false credibility intervals.  A simulation with k = 200, produced a point estimate of 34% false positives with a very wide CI ranging from 0% to 63%. The authors suggest a sensitivity analysis by changing model parameters. The most crucial parameter is the standard deviation.  Increasing the standard deviation to 2, increases the upper limit of the 95%CI to 75%.  Thus, without good justification for a specific standard deviation, the data provide very little information about the percentage of false positives underlying this Zcurve.

BMM.k200.png

 

For simulations with k = 100, the prior started to bias the results and the CI no longer included the true value of 0% false positives.

BMM.k100

In conclusion, these simulation results show that BMM promises more than it can deliver.  It is very difficult to distinguish p-values sampled from H0 (mean z = 0) and those sampled from H1 with weak evidence (e.g., mean z = 0.1).

In the Challenges and Limitations section, the authors pretty much agree with this assessment of BMM (Gronau et al., 2017, p. 1230).

The procedure does come with three important caveats.

First, estimating the parameters of the mixture model is an inherently difficult statistical problem. ..  and consequently a relatively large number of p values are required for the mixture model to provide informative results. 

A second caveat is that, even when a reasonable number of p values are available, a change in the parameter priors might bring about a noticeably different result.

The final caveat is that our approach uses a simple parametric form to account for the distribution of p values that stem from H1. Such simplicity comes with the risk of model-misspecification.

Practical Implications

Despite the limitations of BMM, the authors applied BMM to several real data.  The most interesting application selected focal hypothesis tests from social priming studies.  Social priming studies have come under attack as a research area with sloppy research methods as well as fraud (Stapel).  Bias tests show clear evidence that published results were obtained with questionable scientific practices (Schimmack, 2017a, 2017b).

The authors analyzed 159 social priming p-values.  The 95%CI for the percentage of false positives ranged from 48% to 88%.  When the standard deviation was increased to 2, the 95%CI increased slightly to 56% to 91%.  However, when the standard deviation was halved, the 95%CI ranged from only 10% to 75%.  These results confirm the authors’ warning that estimates in small sets of studies (k < 200) are highly sensitive to the specification of priors.

What inferences can be drawn from these results about the social priming literature?  A false positive percentage of 10% doesn’t sound so bad.  A false positive percentage of 88% sound terrible. A priori, the percentage is somewhere between 0 and 100%. After looking at the data, uncertainty about the percentage of false positives in the social priming literature remains large.  Proponents will focus on the 10% estimate and critics will use the 88% estimate.  The data simply do not resolve inconsistent prior assumptions about the credibility of discoveries in social priming research.

In short, BMM promises that it can estimate the percentage of false positives in a set of studies, but in practice these estimates are too imprecise and too dependent on prior assumptions to be very useful.

A Zcurve of Social Priming Studies (k = 159)

It is instructive to compare the BMM results to a Zcurve analysis of the same data.

SocialPriming.png

The zcurve graph shows a steep drop and very few z-scores greater than 4, which tend to have a high success rate in actual replication attempts (OSC, 2015).  The average estimated replicability is only 27%.  This is consistent with the more limited analysis of social priming studies in Kahneman’ s Thinking Fast and Slow book (Schimmack, 2017a).

More important than the point estimate is that the 95%CI ranges from 15% to a maximum of 39%.  Thus, even a sample size of 159 studies is sufficient to provide conclusive evidence that these published studies have a low probability of replicating even if it were possible to reproduce the exact conditions again.

These results show that it is not very useful to distinguish between false positives with a replicability of 5% and true positives with a replicability of 6, 10, or 15%.  Good research provides evidence that can be replicated at least with a reasonable degree of statistical power.  Tversky and Kahneman (1971) suggested a minimum of 50% and most social priming studies fail to meet this minimal standard and hardly any studies seem to have been planned with the typical standard of 80% power.

The power estimates below the x-axis show that a nomimal z-score of 4 or higher is required to achieve 50% average power and an actual false positive risk of 5%. Thus, after correcting for deceptive publication practices, most of the seemingly statistically significant results are actually not significant with the common criterion of a 5% risk of a false positive.

The difference between BMM and Zcurve is captured in the distinction between evidence of absence and absence of evidence.  BMM aims to provide evidence of absence (false positives). In contrast, Zcurve has the more modest goal of demonstrating absence (or presence) of evidence.  It is unknown whether any social priming studies could produce robust and replicable effects and under what conditions these effects occur or do not occur.  However, it is not possible to conclude from the poorly designed studies and the selectively reported results that social priming effects are zero.

Conclusion

Zcurve and BMM are both mixture models, but they have different statistical approaches, they have different aims.  They also differ in their ability to provide useful estimates.  Zcurve is designed to estimate average unconditional power to obtain significant results without distinguishing between true positives and false positives.  False positives reduce average power, just like low powered studies, and in reality it can be difficult or impossible to distinguish between a false positive with an effect size of zero and a true positive with an effect size that is negligibly different from zero.

The main problem of BMM is that it treats the nil-hypothesis as an important hypothesis that can be accepted or rejected.  However, this is a logical fallacy.  it is possible to reject an implausible effect sizes (e.g., the nil-hypothesis is probably false if the 95%CI ranges from .8 to  1.2], but it is not possible to accept the nil-hypothesis because there are always values close to 0 that are also consistent with the data.

The problem of BMM is that it contrasts the point-nil-hypothesis with all other values, even if these values are very close to zero.  The same problem plagues the use of Bayes-Factors that compare the point-nil-hypothesis with all other values (Rouder et al., 2009).  A Bayes-Factor in favor of the point nil-hypothesis is often interpreted as if all the other effect sizes are inconsistent with the data.  However, this is a logical fallacy because data that are inconsistent with a specific H1 can be consistent with an alternative H1.  Thus, a BF in favor of H0 can only be interpreted as evidence against a specific H1, but never as evidence that the nil-hypothesis is true.

To conclude, I have argued that it is more important to estimate the replicability of published results than to estimate the percentage of false positives.  A literature with 100% true positives and average power of 10% is no more desirable than a literature with 50% false positives and 50% true positives with 20% power.  Ideally, researchers should conduct studies with 80% power and honest reporting of statistics and failed replications should control the false discovery rate.  The Zcurve for social priming studies shows that priming researchers did not follow these basic and old principles of good science.  As a result, decades of research are worthless and Kahneman was right to compare social priming research to a train wreck because the conductors ignored all warning signs.

 

 

 

An Even Better P-curve

It is my pleasure to post the first guest post on the R-Index blog.  The blog post is written by my colleague and partner in “crime”-detection, Jerry Brunner.  I hope we will see many more guest posts by Jerry in the future.

GUEST POST:

Jerry Brunner
Department of Statistical Sciences
University of Toronto


First, my thanks to the mysterious Dr. R for the opportunity to do this guest post. At issue are the estimates of population mean power produced by the online p-curve app. The current version is 4.06, available at http://www.p-curve.com/app4/pcurve4.php. As the p-curve team (Simmons, Nelson, and Simonsohn) observe in their blog post entitled “P-curve handles heterogeneity just fine” at http://datacolada.org/67, the app does well on average as long as there is not too much heterogeneity in power. They show in one of their examples that it can over-estimate mean power when there is substantial heterogeneity.

Heterogeneity in power is produced by heterogeneity in effect size and heterogeneity in sample size. In the simulations reported at http://datacolada.org/67, sample size varies over a fairly narrow range — as one might expect from a meta-analysis of small-sample studies. What if we wanted to estimate mean power for sets of studies with large heterogeneity in sample sizes or an entire discipline, or sub-areas, or journals, or psychology departments? Sample size would be much more variable.

This post gives an example in which the p-curve app consistently over-estimates population mean power under realistic heterogeneity in sample size. To demonstrate that heterogeneity in sample size alone is a problem for the online pcurve app, population effect size was held constant.

In 2016, Brunner and Schimmack developed an alternative p-curve method (p-curve 2.1), which performs much better than the online app p-curve 4.06. P-curve 2.1 is fully documented and evaluated in Brunner and Schimmack (2018). This is the most recent version of the notorious and often-rejected paper mentioned in https://replicationindex.wordpress.com/201/03/25/open-discussion-forum. It has been re-written once again, and submitted to Meta-psychology. It will shortly be posted during the open review process, but in the meantime I have put a copy on my website at http://www.utstat.toronto.edu/~brunner/papers/Zcurve6.7.pdf.

P-curve 2.1 is based on Simonsohn, Nelson and Simmons’ (2014) p-curve estimate of effect size. It is designed specifically for the situation where there is heterogeneity in sample size, but just a single fixed effect size. P-curve 2.1 is a simple, almost trivial application of p-curve 2.0. It first uses the p-curve 2.0 method to estimate a common effect size. It then combines that estimated effect size and the observed sample sizes to calculate an estimated power for each significance test in the sample. The sample mean of the estimated power values is the p-curve 2.1 estimate.

One of the virtues of p-curve is that it allows for publication bias, using only significant test statistics as input. The population mean power being estimated is the mean power of the sub-population of tests that happened to be significant. To compare the performance of p-curve 4.06 to p-curve 2.1, I simulated samples of significant test statistics with a single effect size, and realistic heterogeneity in sample size.

Here’s how I arrived at the “realistic” sample sizes. In another project, Uli Schimmack had harvested a large number of t and F statistics from the journal Psychological Science, from the years 2001-2015. I used N = df + 2 to calculate implied total sample sizes. I then eliminated all sample sizes less than 20 and greater than 500, and randomly sampled 5,000 of the remaining numbers. These 5,000 numbers will be called the “Psychological Science urn.” They are available at http://www.utstat.toronto.edu/~brunner/data/power/PsychScience.urn3.txt, and can be read directly into R with the scan function.

The numbers in the Psychological Science urn are not exactly sample sizes and they are not a true random sample. In particular, truncating the distribution at 500 makes them less heterogeneous than real sample sizes, since web surveys with enormous sample sizes are eliminated. Still, I believe the numbers in the Psychological Science urn may be fairly reflective of the sample sizes in psychology journals. Certainly, they are better than anything I would be able to make up. Figure 1 shows a histogram, which is right skewed as one might expect.

Figure1

By sampling with replacement from the Psychological Science urn, one could obtain a random sample of sample sizes, similar to sampling without replacement from a very large population of studies. However, that’s not what I did. Selection for significance tends to select larger sample sizes, because tests based on smaller sample sizes have lower power and so are less likely to be significant. The numbers in the Psychological Science urn come from studies that passed the filter of publication bias. It is the distribution of sample size after selection for significance that should match Figure 1.

To take care of this issue, I constructed a distribution of sample size before selection and chose an effect size that yielded (a) population mean power after selection equal to 0.50, and (b) a population distribution of sample size after selection that exactly matched the relative frequencies in the Psychological Science urn. The fixed effect size, in a metric of Cohen (1988, p. 216) was w = 0.108812. This is roughly Cohen’s “small” value of w = 0.10. If you have done any simulations involving literal selection for significance, you will realize that getting the numbers to come out just right by trial and error would be nearly impossible. I got the job done by using a theoretical result from Brunner and Schimmack (2018). Details are given at the end of this post, after the results.

I based the simulations on k=1,000 significant chi-squared tests with 5 degrees of freedom. This large value of k (the number of studies, or significance tests on which the estimates are based) means that estimates should be very accurate. To calculate the estimates for p-curve 4.06, it was easy enough to get R to write input suitable for pasting into the online app. For p-curve 2.1, I used the function heteroNpcurveCHI, part of a collection developed for the Brunner and Schimmack paper. The code for all the functions is available at http://www.utstat.toronto.edu/~brunner/Rfunctions/estimatR.txt. Within R, the functions can be defined with source("http://www.utstat.toronto.edu/~brunner/Rfunctions/estimatR.txt"). Then to see a list of functions, type functions() at the R prompt.

Recall that population mean power after selection is 0.50. The first time I ran the simulation, the p-curve 4.06 estimate was 0.64, with a 95% confidence interval from from 0.61 to 0.66.. The p-curve 2.1 estimate was 0.501. Was this a fluke? The results of five more independent runs are given in the table below. Again, the true value of mean power after selection for significance is 0.50.

Estimate
P-curve 2.1 P-curve 4.06 P-curve 4.06 Confidence Interval
0.510 0.64 0.61 0.67
0.497 0.62 0.59 0.65
0.502 0.62 0.59 0.65
0.509 0.64 0.61 0.67
0.487 0.61 0.57 0.64

It is clear that the p-curve 4.06 estimates are consistently too high, while p-curve 2.1 is on the money. One could argue that an error of around twelve percentage points is not too bad (really?), but certainly an error of one percentage point is better. Also, eliminating sample sizes greater than 500 substantially reduced the heterogeneity in sample size. If I had left the huge sample sizes in, the p-curve 4.06 estimates would have been ridiculously high.

Why did p-curve 4.06 fail? The answer is that even with complete homogeneity in effect size, the Psychological Science urn was heterogeneous enough to produce substantial heterogeneity in power. Figure 2 is a histogram of the true (not estimated) power values.

Figure2

Figure 2 shows that that even under homogeneity in effect size, a sample size distribution matching the Psychological Science urn can produce substantial heterogeneity in power, with a mode near one even though the mean is 0.50. In this situation, p-curve 4.06 fails. P-curve 2.1 is clearly preferable, because it specifically allows for heterogeneity in sample size.

Of course p-curve 2.1 does assume homogeneity in effect size. What happens when effect size is heterogeneous too? The paper by Brunner and Schimmack (2018) contains a set of large-scale simulation studies comparing estimates of population mean power from p-curve, p-uniform, maximum likelihood and z-curve, a new method dreamed up by Schimmack. The p-uniform method is based on van Assen, van Aertand and Wicherts (2014), extended to power estimation as in p-curve 2.1. The p-curve method we consider in the paper is p-curve 2.1. It does okay as long as heterogeneity in effect size is modest. Other methods may be better, though. To summarize, maximum likelihood is most accurate when its assumptions about the distribution of effect size are satisfied or approximately satisfied. When effect size is heterogeneous and the assumptions of maximum likelihood are not satisfied, z-curve does best.

I would not presume to tell the p-curve team what to do, but I think they should replace p-curve 4.06 with something like p-curve 2.1. They are free to use my heteroNpcurveCHI and heteroNpcurveF functions if they wish. A reference to Brunner and Schimmack (2018) would be appreciated.

Details about the simulations

Before selection for significance, there is a bivariate distribution of sample size and effect size. This distribution is affected by the selection process, because tests with higher effect size or sample size (or especially, both) are more likely to be significant. The question is, exactly how does selection affect the joint distribution? The answer is in Brunner and Schimmack (2018). This paper is not just a set of simulation studies. It also has a set of “Principles” relating the population distribution of power before selection to its distribution after selection. The principles are actually theorems, but I did not want it to sound too mathematical. Anyway, Principle 6 says that to get the probability of a (sample size, effect size) pair after selection, take the probability before selection, multiply by the power calculated from that pair, and divide by the population mean power before selection.

In the setting we are considering here, there is just a single effect size, so it’s even simpler. The probability of a (sample size, effect size) pair is just the probability of the sample size. Also, we know the probability distribution of sample size after selection. It’s the relative frequencies of the Psychological Science urn. Solving for the probability of sample size before selection yields this rule: the probability of sample size before selection equals the probability of sample size after selection, divided by the power for that sample size, and multiplied by population mean power before selection.

This formula will work for any fixed effect size. That is, for any fixed effect size, there is a probability distribution of sample size before selection that makes the distribution of sample size after selection exactly match the Psychological Science frequencies in Figure 1. Effect size can be anything. So, choose the effect size that makes expected (that is, population mean) power after selection equal to some nice value like 0.50.

Here’s the R code. First, we read the Psychological Science urn and make a table of probabilities.

rm(list=ls())

options(scipen=999) # To avoid scientific notation

source("http://www.utstat.toronto.edu/~brunner/Rfunctions/estimatR.txt"); functions()

PsychScience = scan("http://www.utstat.toronto.edu/~brunner/data/power/PsychScience.urn3.txt")

hist(PsychScience, xlab='Sample size',breaks=100, main = 'Figure 1: The Psychological Science Urn')

# A handier urn, for some purposes

nvals = sort(unique(PsychScience)) # There are 397 rather than 8000 values

nprobs = table(PsychScience)/sum(table(PsychScience))

# sum(nvals*nprobs) = 81.8606 = mean(PsychScience)

For any given effect size, the frequencies from the Psychological Science urn can be used to calculate expected power after selection. Minimizing the (squared) difference between this value and the desired mean power yields the required effect size.

# Minimize this function to find effect size giving desired power 

# after selection for significance.

fun = function(es,wantpow,dfreedom) 

    {

    alpha = 0.05; cv=qchisq(1-alpha,dfreedom)

    epow = sum( (1-pchisq(cv,df=dfreedom,ncp=nvals*es))*nprobs ) 

    # cat("es = ",es," Expected power = ",epow,"\n")

    (epow-wantpow)^2    

    } # End of all the fun

# Find needed effect size for chi-square with df=5 and desired 

# population mean power AFTER selection.



popmeanpower = 0.5 # Change this value if you wish

EffectSize = nlminb(start=0.01, objective=fun,lower=0,df=5,wantpow=popmeanpower)$par

EffectSize # 0.108812

Calculate the probability distribution of sample size before selection.

# The distribution of sample size before selection is proportional to the

# distribution after selection divided by power, term by term.

crit = qchisq(0.95,5)

powvals = 1-pchisq(crit,5,ncp=nvals*EffectSize)

Pn = nprobs/powvals 

EG = 1/sum(Pn)

cat("Expected power before selection = ",EG,"\n")

Pn = Pn*EG # Probability distribution of n before selection

Generate test statistics before selection.

nsim = 50000 # Initial number of simulated statistics. This is over-kill. Change the value if you wish.

set.seed(4444)



# For repeated simulations, execute the rest of the code repeatedly.

nbefore = sample(nvals,size=nsim,replace=TRUE,prob=Pn)

ncpbefore = nbefore*EffectSize

powbefore = 1-pchisq(crit,5,ncp=ncpbefore)

Ybefore = rchisq(nsim,5,ncp=ncpbefore)

Select for significance.

sigY = Ybefore[Ybefore>crit]

sigN = nbefore[Ybefore>crit]

sigPOW = 1-pchisq(crit,5,ncp=sigN*EffectSize)

hist(sigPOW, xlab='Power',breaks=100,freq=F ,main = 'Figure 2: Power After Selection for Significance')

Estimate mean power both ways.

# Two estimates of expected power before selection

c( length(sigY)/nsim , mean(powbefore) ) 

c(popmeanpower, mean(sigPOW)) # Golden

length(sigY)



k = 1000 # Select 1,000 significant results.

Y = sigY[1:k]; n = sigN[1:k]; TruePower = sigPOW[1:k]



# Estimate with p-curve 2.1

heteroNpcurveCHI(Y=Y,dfree=5,nn=n) # 0.5058606 the first time.



# Write out chi-squared statistics for pasting into the online app

for(j in 1:k) cat("chi2(5) =",Y[j],"\n")

References

Brunner, J. and Schimmack, U. (2018). Estimating population mean power under conditions of heterogeneity and selection for significance. Under review. Available at http://www.utstat.toronto.edu/~brunner/papers/Zcurve6.7.pdf.

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Edition), Hillsdale, New Jersey: Erlbaum.

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve and effect size: correcting for publication bias using only significant results. Perspectives on Psychological Science, 9, 666-681.

van Assen, M. A. L. M., van Aert, R. C. M., & Wicherts, J. M. (2014). Meta-analysis using effect size distributions of only statistically significant studies. Psychological methods, 20, 293-309.

 

Lies, Damn Lies, and Abnormal Psychological Science (APS)

Everybody knows the saying “Lies, damn lies, and statistics”   But it is not the statistics; it is the ab/users of statistics who are distorting the truth.  The Association for Psychological Science (APS) is trying to hide the truth that experimental psychologists are not using scientific methods in the way they are supposed to be used.  These abnormal practices are known as questionable research practices (QRPs).   Surveys show that researchers are aware that these practices have negative consequences, but they also show that these practices are being used because they can advance researchers careers (John et al., 2012).  Before 2011, it was also no secrete that these practices were used and psychologists might even brag about the use of QRPs to get results  (it took me 20 attempts to find this significant result).

However, some scandals in social psychology (Stapel, Bem) changed the perception of these practices.  Hiding studies, removing outliers selectively, or not disclosing dependent variables that failed to show the predicted result was no longer something anybody would admit doing in public (except a few people who paid dearly for it; e.g. Wansink).

Unfortunately for abnormal psychological scientists, some researchers, including myself, have developed statistical methods that can reveal the use of questionable research practices and applications of these methods show the use of QRPs in numerous articles (Greg Francis; Schimmack, 2012).  Francis (2014) showed that 80% or more of articles in the flagship journal of APS used QRPs to report successful studies.  He was actually invited by the editor of Psychological Science to audit the journal, but when he submitted the results of his audit for publication, the manuscript was rejected. Apparently, it was not significant enough to tell readers of Psychological Science that most of the published articles in Psychological Science are based on abnormal psychological science.  Fortunately, the results were published in another peer-reviewed journal.

Another major embarrassment for APS was the result of a major replication project of studies published in Psychological Science, the main APS journal, as well as two APA (American Psychological Association) journals (Open Science Collaboration, 2015).  The results showed that only 36% of significant results in original articles could be replicated. The “success rate” for social psychology was even lower with 25%.  The main response to this stunning failure rate have been attempts to discredit the replication studies or to normalize replication failures as a normal outcome of science.

In several blog posts and manuscripts I have pointed out that the failure rate of social psychology is not the result of normal science.  Instead, replication failures are the result of abnormal scientific practices where researchers use QRPs to produce significant results.  My colleague Jerry Brunner developed a statistical method, z-curve, that reveals this fact. We have tried to publish our statistical method in an APA journal (Psychological Methods) and the APS journal, Perspectives on Psychological Science, where it was desk-rejected by Sternberg, who needed journal space to publish his own editorials [he resigned after a revolt form APS members, including former editor Bobbie Spellman].

Each time our manuscript was rejected without any criticism of our statistical method.  The reason was that it was not interesting to estimate replicability of psychological science.   This argument makes little sense because the OSC reproducibility article from 2015 has already been cited over 500 times in peer-reviewed journals (WebofScience).

The argument that our work is not interesting is further undermined by a recent article published in the new APS journal Advances in Methods and Practices in Psychological Science  with the title “The Prior Odds of Testing a True Effect in Cognitive and Social Psychology”  The article was accepted by the main editor Daniel J. Simons, who also rejected our article as irrelevant (see rejection letter).  Ironically, the article presents very similar analyses of the OSC data and required a method that could estimate average power, but the authors used an ad-hoc approach to do so.  The article even cites our pre-print, but the authors did not contact us or run the R-code that we shared to estimate average power.  This behavior would be like eyeballing a scatter plot rather than using a formula to quantify the correlation between two variables.  It is contradictory to claim that our method is not useful and then accept a paper that could have benefited from using our method.

Why would an editor reject a paper that provides an estimation method for a parameter that an accepted paper needs to estimate?

One possible explanation is that the accepted article normalizes replication failures, while we showed that these replication failures are at least partially explained by QRPs.  First evidence for the normalization of abnormal science is that the article does not cite Francis (2014) or Schimmack (2012) or John et al.’s (2012) survey about questionable research practices.  The article also does not mention Sterling’s work on abnormally high success rates in psychology journals (Sterling, 1959; Sterling et al., 1995). It does not mention Simmons, Nelson, and Simonsohn’s (2011) False-Positive Psychology article that discussed the harmful consequences of abnormal psychological science.  The article simply never mentions the term questionable research practices. Nor does it mention the “replication crisis” although it mentions that the OSC project replicated only 25% of findings in social psychology.  Apparently, this is neither abnormal nor symptomatic of a crisis, but just how good social psychological science works.

So, how does this article explain the low replicability of social psychology as normal science?  The authors point out that replicability is a function of the percentage of true null-hypothesis that are being tested.  As researchers conduct empirical studies to find out which predicts are true and which predicts are not, it is normal science to sometimes predict effects that do not exist (true null-hypotheses), and inferential statistics will sometimes lead to the wrong conclusion (type-I errors / false positives).  It is therefore unavoidable that empirical scientists will sometimes make mistakes.

The question is how often they make these mistakes and how they correct them.  How many false-positives end up in the literature depends on several other factors, including (a) the percentage of null-hypothesis that are being tested and (b) questionable research practices.

The key argument in the article is that social psychologists are risk-takers and test many false hypothesis.  As a result, they end up finding many false positive results. Replication studies are needed to show which findings are true and which findings are false. So, doing risky exploratory studies followed by replication studies is good science. In contrast, cognitive psychologist are not risk-takers and test hypothesis that have a high probability of being true. Thus, they have fewer false positives, but that doesn’t mean they are better scientists or social psychologists are worse scientists.  In the happy place of APS journals, all psychological scientists are good scientists.

Conceivably, social psychologists place higher value on surprising findings—that is, findings that reflect a departure from what is already known—than cognitive  psychologists do.

There is only one problem with this happy story of psychological scientists working hard to find the truth using the best possible scientific methods.  It is not true.

How Many Point-Nil-Hypothesis are True

How often is the null-hypothesis true?  To answer this question it is important to define the null-hypothesis.  A null-hypothesis can be any point or a range of effect sizes.  However, psychologists often wrongly use the term null-hypothesis to refer to the point-nil-hypothesis (cf. Cohen, 1994) that there is absolutely no effect (e.g., the effect of studying for a test AFTER the test has already happened; Bem, 2011).  We can then distinguish two sets of studies. Studies with an effect of any magnitude and studies without an effect.

The authors argue correctly that testing many null-effects will result in more false positives and lower replicability.  This is easy to see, if all significant results are false positives (Bem, 2011).  The probability that any single replication study produces a significant result is simply alpha (5%) and for a set of studies only 5% of studies are expected to produce a significant result. This is the worst case scenario (Rosenthal, 1979; Sterling et al., 1995).

Importantly, this does not only apply to replication studies. It also applies to original studies. If all studies have a true effect size of zero, only 5% of studies should produce a significant result.  However, it is well known that the success rate in psychology journals is above 90% (Sterling, 1959; Sterling et al., 1995).  Thus, it is not clear how social psychology can test many risky hypothesis that are often false and report over 90% successes in their journal or even within a single article (Schimmack, 2012). The only way to achieve this high success rate while most hypothesis are false is by reporting only successful studies (like a gambling addict who only counts wins and ignores losses; Rosenthal, 1979) or to make up hypothesis after randomly finding a significant result (Kerr, 1998).

To my knowledge, Sterling et al. (1995) were the first to relate the expected failure rate (without QRPs) to alpha, power, and the percentage of studies with and without an effect.

Sterling.quote.png

Sterling et al. point out that we should not have expected that 100% of published results in the Open Science Collaboration reported significant results, while the 25% success rate in the replication studies is shockingly low, but at least more believable than the 100% success rate.  The article neither mentions Sterling’s statistical contribution, nor the implication for the expected success rate in original studies.

The main aim of the authors is to separate the effects of power and the proportion of studies without effects on the success rate; that is the percentage of studies with significant results.

For example, a 25% success rate for social psychology could be produced by 25 studies with 85% power and 75% of studies without an effect (and a 5% chance of producing a significant result) or it could be produced by 100 studies with an average of 25% power, or any other percentage of studies with an effect between 25% and 100%.

As pointed out by Brunner and Schimmack (2017), it is impossible to obtain a precise estimate of this percentage because different mixtures of studies can produce the same success rate.  I was therefore surprised when the abstract claimed that “we found that R was lower for the social-psychology studies than for the cognitive-psychology studies”  How were the authors able to quantify and compare the proportions of studies with an effect in social psychology versus cognitive psychology? The answer is provided in the following quote.

Using binary logic for the time being, we assume that the observed proportion of studies yielding effects in the same direction as originally observed, ω, is equal to the proportion of true effects, PPV, plus half of the remaining 1 – PPV noneffects, which would be expected to yield effects in the same direction as originally observed 50% of the time by chance.

To clarify,  a null-result is equally likely to produce a positive or a negative effect size by chance.  A sign reversal in a replication study is used to infer that the original result was a false positive.  However, these sign reversals are only half of the false positives because random chance is equally likely to produce the same sign (head-tail is equally probable as head-head).  Using this logic, the percentage of sign reversals times two is an estimate of the percentage of false positives in the original studies.

Based on the finding that 25.5% of social replication studies showed a sign reversal, the authors conclude that 51% of the original significant results were false positives.  This would imply that every other significant result that is published in social psychology journals is a false positive.

One problem with this approach is that sign reversals can also occur for true positive studies with low power (Gelman & Carlin, 2014).  Thus, the percentage of sign reversals is at best a rough estimate of false positive results.

However, low power can be the result of small effect sizes and many of these effect sizes might be so close to zero that they can be considered false positives if the null-hypothesis is defined as a range of effect sizes close to zero.

So, I will just use the authors estimate of 50% false positive results as a reasonable estimate of the percentage of false positive results that are reported in social psychology journals.

Are Social Psychologists Testing Riskier Hypotheses? 

The authors claim that social psychologists have more false positive results than cognitive psychologists because they test more false hypotheses. That is, they are risk takers:

Maybe watching a Beyoncé video reduces implicit bias? Let’s try it (with n = 20 per cell in a between-subject design).  It doesn’t and the study produced a non-significant result.  Let’s try something else.  After trying many other manipulations, finally a significant result is observed and published.  Unfortunately, this manipulation also had no effect and the published result is a false positive.  Another researcher replicates the study and obtains a significant result with a sign reversal. The original result gets corrected and the search for a true effect continues.

formula.png

To make claims about the ratio of studies with effects and studies without effects (or negligible effects) that are being tested, the authors use the formula shown above.  Here the ratio (R) of studies with an effect over studies without an effect is a function of  alpha (the criterion for significance), beta (type-II error probabilty), and PPV; the positive predictive value, which is simply the percentage of true positive significant results in the published literature.

As note before, the PPV for social psychology was estimated to be 49%. This leaves two unknowns to make claims about R; alpha and beta.  The authors approach to estimating alpha and beta is questionable and undermines their main conclusion.

Estimating Alpha

The authors use the nominal alpha level as the probability that a study without a real effect produces a false positive result.

Social and cognitive psychology generally follow the same convention for their alpha level (i.e., p < .05), so the difference in that variable likely does not explain the difference in PPV. 

However, this is a highly questionable assumption when researcher use questionable research practices.  As Simonsohn et al. (2011) demonstrated p-hacking can be used to bypass the significance filter and the risk of reporting a false positive result with a nominal alpha of 5% can be over 50%.  That is, the actual risk of reporting a false positive result is not 5% as stated, but much higher.  This has clear implications for the presence of false positive results in the literature.  While it would require 20 risky hypotheses to observe a false positive result with a significance filter of 5%, p-hacking makes it possible to report every other false positive result as significant.  Thus, massive p-hacking could explain a high percentage of false positive results in social psychology just as well as honest testing of risky hypotheses.

The authors simply ignore this possibility when they use the nominal alpha level as the factual probability of a false positive result and neither the reviewers nor the editor seemed to realize that p-hacking could explain replication failures.

Is there any evidence that p-hacking rather than risk-taking explains the results? Indeed, there is lots of evidence.  As I pointed out in 2012,  it is easy to see that social psychologists are using QRPs because they typically report multiple conceptual replication studies in a single article. Many of the studies in the replication project were selected from multiple study articles.  A multiple study article essentially lowers alpha from .05 in a single study to .05 raised to the power of the number of studies. Even with just two studies, the risk of repeating a false positive result is just .05^2 = .0025.  And none of these multiple study articles report replication failures, even if the tested hypothesis is ridiculous (Bem, 2011).  There are only two explanation for the high success rate in social psychology.  Either they are testing true hypothesis and the estimate of 50% false positive results is wrong or they are using p-hacking and the risk of a false positive results in a single study is greater than the nominal alpha.  Either explanation invalidates the authors conclusions about R. Either their PPV estimates are wrong or their assumptions about the real alpha criterion are wrong.

Estimating Beta

Beta or the type-II error is the risk of obtaining a non-significant result when an effect exists.  Power is the complementary probability of getting a significant result when an effect is present (a true positive result).  The authors note that social psychologists might end up with more false positive results because they conduct studies with lower power.

To illustrate, imagine that social psychologists run 100 studies with an average power of 50% and 250 studies without an effect and due to QRPs 20% of these studies produce a significant result with a nominal alpha of p < .05.  In this case, there are 50 true positive results (100 * .50 = 50) and 50 false positive results (250 * .20 = 50).   In contrast, cognitive psychologists conduct studies with 80% power, while everything else is the same. In this case,  there would be 80 true positive results (100 * .8 = 80) and also 50 false positive results.  The percentage of false positives would be 50% for social, but only 50/(50+80) = 38% false positives for cognitive psychology.  In this example, R and alpha are held constant, but the PPVs differ simply as a function of power.  If we assume that cognitive psychologists use less severe p-hacking, there could be even fewer false positives (250 * .10 = 25) and the PPV for cognitive psychology would be only 24%.  [actual estimate in the article is 19%]

Thus, to make claims about differences between social psychologists and cognitive psychologists, it is necessary to estimate beta or power (1 – beta) and because power varies across the diverse studies in the OSC project, they have to estimate average power.  Moreover, because only significant studies are relevant, they need to estimate the average power after selection for significance.  The problem is that there exists no published peer-reviewed method to do this.  The reason why no published peer-reviewed method exists is that editors have rejected our manuscripts that have evaluated four different methods of estimating average power after selection for significance and shown that z-curve is the best method.

How do the authors estimate average power after selection for significance without z-curve?  They  use p-curve plots and use visual inspection of the plots against simulations of data with fixed power to obtain rough estimates of  50% average power for social psychology and 80% average power for cognitive psychology.

It is interesting that the authors used p-curve plots, but did not use the p-curve online app to estimate average power.  The online p-curve app also provides power estimates. However, we pointed out in the rejected manuscript, this method can severely overestimate average power. In fact when the online p-curve app is used, it produces estimates of 96% average power for social psychology and 98% for cognitive psychology. These estimates are implausible and this is the reason why the authors created their own ad-hoc method of power estimation rather than using the estimates provided by the p-curve app.

We used the p-curve app and also got really high power estimates that seemed implausible, so we used ballpark estimates from the Simonsohn et al. (2014) paper instead (Brent Wilson, email communication, May 7, 2018). 

 

pcurve.pngBased on their visual inspection of the graphs they conclude that the average power in social psychology is about 50% and the average power in cognitive psychology is about 80%.

Putting it all together 

After estimating PPV, alpha, and beta in the way described above, the authors used the formula to estimate R.

If we set PPV to .49, αlpha to .05, and 1 – β (i.e., power) to .50 for the social-psychology
studies and we set the corresponding values to .81, .05, and .80 for the cognitive-psychology studies, Equation 2 shows that R is .10 (odds = 1 to 10) for social psychology
and .27 (odds = 1 to ~4) for cognitive psychology. 

Now the authors make another mistake.  The power estimate obtained from p-curve applies to ALL p-values, including the false positive ones.  Of course, the average estimate of power is lower for a set of studies that contains more false positive results.

To end up with 50% average power with 50% false positive results,  the power of the studies that are not false positives can be computed with the following formula.

Avg.Power = FP*alpha + TP*power   <=>  power = (Avg.Power – FP*alpha)/TP

With 49% true positives (TP), 51% false positives (FP), alpha = .05, and average power = .50 for social psychology, the estimated average power of studies with an effect is 97%.

alpha = .05; avg.power = .50; TP = .49; FP = 1-TP;  (avg.power – FP*alpha)/TP

With 81% true positives and 80% average power for cognitive psychology, the estimated average power of studies with an effect in cognitive psychology  is 98%.

Thus, there is actually no difference in power between social and cognitive psychology because the percentage of false positive results alone explains the differences in the estimates of average power for all studies.

formula2

alpha = .05; PPV = .49; power = .96; alpha*PPV/(power * (1-PPV))
alpha = .05; PPV = .81; power = .97; alpha*PPV/(power * (1-PPV))

With these correct estimates of power for studies with true effects, the estimate for social psychology is .05 and the estimate for cognitive psychology is .22.  This means the social psychologists test 20 false hypothesis for every true hypothesis, while cognitive psycholgists test 4.55 false hypothesis for every correct hypothesis, assuming the authors assumptions are correct.

Conclusion

The authors make some questionable assumptions and some errors to arrive at the conclusion that social psychologists are conducting many studies with no real effect. All of these studies are run with a high level of power. When a non-significant result is obtained, they discard the hypothesis and move on to testing another one.  The significance filter keeps most of the false hypothesis out of the literature, but because there are so many false hypothesis, 50% of the published results end up being false positives.  Unfortunately, social psychologists failed to conduct actual replication studies and a large pile of false positive results accumulated in the literature until social psychologists realized that they need to replicate findings in 2011.

Although this is not really a flattering description of social psychology, the truth is worse.  Social psychologists have been replicating findings for a long time. However, they never reported studies that failed to replicate earlier findings and when possible they used statistical tricks to produce empirical findings that supported their conclusions with a nominal error rate of 5%, while the true error rate was much higher.  Only scandals in 2011 led to honest reporting of replication failures. However, these replication studies were conducted by independent investigators, while researchers with influential theories tried to discredit these replication failures.  Nobody is willing to admit that abnormal scientific practices may explain why many famous findings in social psychology textbooks were suddenly no longer replicable after 2011, especially when hypotheses and research protocols were preregistered and prevented the use of questionable research practices.

Ultimately, the truth will be published in peer-reviewed journals. APS does not control all journals.  When the truth becomes apparent,  APS will look bad because it did nothing to enforce normal scientific practices and it will look worse because it tried to cover up the truth.  Thank god , former APS president Susan Fiske reminded her colleagues that real scientists should welcome humiliation when their mistakes come to light because the self-correcting forces of science are more important than researchers feelings. So far, APS leaders seem to prefer repressive coping over open acknowledgment of past mistakes. I wonder what the most famous psychologists of all times would have to say about this.

Estimating Reproducibility of Psychology (No. 52): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The replication crisis has split psychologists and disrupted social networks.  I respected Jerry Clore as an emotion researcher when I started my career in emotion research.  His work on appraisal theories of emotions made an important contribution and influenced my thinking about emotions.  I enjoyed participating in Jerry’s lab meetings when I was a post-doctoral student of Ed Diener at Illinois.  However, I was never a big fan of Jerry’s most famous article on the effect of mood on life-satisfaction judgments.  Working with Ed Diener convinced me that life-satisfaction judgments are more stable and more strongly based on chronically accessible information than the mood as information model suggested (Anusic & Schimmack, 2016; Schimmack & Oishi, 2005).  Nevertheless, I had a positive relationship with Jerry and I am grateful that he wrote recommendation letters for me when I was on the job market.

When researchers started doing replication studies after 2011, some of Jerry’s articles failed to replicate, and one reason for these replication failures is that the original studies used questionable research practices.  Importantly, nobody considered these practices unethical and it was not a secret that these methods were used. Books even taught students that the use of these practices is good science.  The problem is that Jerry didn’t acknowledge that questionable practices could at least partially explain replication failures.  Maybe he did it to protect students like Simone Schnall. Maybe he had other reasons.  Personally, I was disappointed by this response to replication failures, but I guess that is life.

Summary of Original Article

clore

In five studies, the authors crossed the priming of happy and sad concepts with affective experiences. In all studies, the expected interaction was significant. Coherence between affective concepts and affective experiences led to better recall of a story than in the incoherent conditions.

Study 1

56 students were assigned to six conditions (n ~ 10) of a 2 x 3 design. Three priming conditions with a scrambled sentence task were crossed with a manipulation of flexing or extending one arm. This manipulation is supposed to create an approach or avoidance motivation (Cacioppo et al., 1993).  The expected interaction was significant, F(2, 50) = 3.50, p = .038.

Study 2

75 students participated in Study 2, which was a replication study with two changes:  the arm position manipulation was paired with the priming task and half the participants rated their mood before the measurement of the DV.  The ANOVA result was marginally significant; F(2, 69) = 2.81, p = .067.

Study 3

58 students used the same priming procedure, but used music as a mood manipulation.  The neutral priming condition was dropped (n ~ 15 per cell).  The interaction effect was marginally significant, F(1, 54) = 3.48, p = .068.

Study 4

132 students participated in Study 4.  The study changed the priming task to a subliminal priming manipulation (although the 60ms presentation time may not be fully subliminal).  Affect was manipulated by asking participants to hold a happy or sad facial expression.  The interaction was significant, F(1, 128) = 3.97, p = .048.

Study 5 

133 students participated in Study 5.  Study 5 combined the scrambled sentence priming manipulation from Studies 1-3 with the facial expression manipulation from Study 4.  The interaction effect was significant, F(1, 129) = 5.21, p = .024.

Replicability Analysis

Although all five studies showed support for the predicted two-way interaction, the p-values in the five studies are surprisingly similar (ps = .038, .067, .068, .048, .025). The probability of such small variability or even less variability in p-values is p = .002 (TIVA).  This suggests that QRPs were used to produce (marginally) significant results in five studies with low power (Schimmack, 2012).

A small set of studies provides limited information about QRPs.  It is helpful to look at these p-values in the context of other results reported in articles with Jerry Clore as co-author.

clore

The plot shows a large file-drawer (missing studies with non-significant results) that is produced by a large number of just significant results.  Either many studies were run to obtain a just significant result or other QRPs were used.  This analysis supports the conclusion that QRPs contributed to the reported results in the original article.

Replication Study

The replication project attempted a replication of Study 5.  However, the authors did not pick the 2 x 2 interaction as the target finding.  Instead, they used the finding in a “repeated measures ANOVA with condition (coherent vs. incoherent) and story prompt (tree vs. house. vs. car) produced a significant linear trend for the interaction of Condition X Story, F(1, 131), 5.79, p < .02, η2 = .04” (Centerbar, et a., 2008, p. 573).  The replication study did not find this trend, F(2, 110) = .759, p = .471.  However, the difference in degrees of freedom shows that the replication analysis had less power because it did not test the linear contrast. Moreover, the replication report states that the replication study showed a trend regarding the main effect of affective coherence on the percentage of causal words used, F(1, 111) = 3.172, p = .078.  This makes it difficult to evaluate whether the replication study was really a failure.

I used the posted data to test the interaction for the total number of words produced. It was not significant, F(1,126) = 0.602, p = .439.

In conclusion, the reported significant interaction failed to replicate.

Conclusion

The replication study of this 2 x 2 between-subject social psychology experiment failed to replicate the original result.  Bias tests suggests that the replication failure was at least partially caused by the use of questionable research practices in the original study.

 

 

 

 

 

 

 

Confused about Effect Sizes? Read more Cohen (and less Loken & Gelman)

*** Background.  The Loken and Gelman article “Measurement Error and the Replication Crisis” created a lot of controversy in the Psychological Methods Discussion Group. I believe the article is confusing and potentially misleading. For example, the authors do not clearly distinguish between unstandardized and standardized effect size measures, although random measurement error has different consequences for one or the other.  I think a blog post by Gelman makes it clear what the true purpose of the article. is.

We talked about why traditional statistics are often counterproductive to research in the human sciences.

This explains why the article tries to construct one more fallacy in the use of traditional statistics, but fails to point out a simple solution to avoid this fallacy.  Moreover, I argue in the blog post that Loken and Gelman committed several fallacies on their own in an attempt to discredit t-values and significance testing.

I asked Gelman to clarify several statements that made no sense to me. 

 “It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal-to-noise level is high”  (Loken and Gelman)

Ulrich Schimmack
Would you say that there is no meaningful difference between a z-score of 2 and a z-score of 4? These z-scores are significantly different from each other. Why would we not say that a study with a z-score of 4 provides stronger evidence for an effect than a study with a z-score of 2?

  • Andrew says:

    Ulrich:

    Sure, fair enough. The z-score provides some information. I guess I’d just say it provides less information than people think.

 

I believe that the article contains many more statements that are misleading and do not inform readers how t-values and significance testing works.  Maybe the article is not as bad as I think it is, but I am pretty sure that it provides less information than people think.

In contrast, Jacob Cohen has  provided clear and instructive recommendations for psychologists to improve their science.  If psychologists had listened to him, we wouldn’t have a replication crisis.

The main points to realize about random measurement error and replicability are.

1.  Neither population nor sample mean differences (or covariances) are effect sizes. They are statistics that provide some information about effects and the magnitude of effects.  The main problem in psychology has been the interpretation of mean differences in small samples as “observed effect sizes”  Effects cannot be observed.

2.  Point estimates of effect sizes vary from sample to sample.  It is incorrect to interpret a point estimate as information about the size of an effect in a sample or a population. To avoid this problem, researchers should always report a confidence interval of plausible effect sizes. In small samples with just significant results these intervals are wide and often close to zero.  Thus, no research should interpret a moderate to large point estimate, when effect sizes close to zero are also consistent with the data.

3.  Random measurement creates more uncertainty about effect sizes.  It has no systematic effect on unstandardized effect sizes, but it systematically lowers standardized effect sizes (correlations, Cohen’s d amount of explained variance).

4.  Selection for significance inflates standardized and unstandardized effect size estimates.  Replication studies may fail if original studies were selected for significance, depending on the amount of bias introduced by selection for significance (this is essentially regression to the mean).

5. As random measurement error attenuates standardized effect sizes,  selection for significance partially corrects for this attenuation.  Applying a correction formula (Spearman) to estimates after selection for significance would produce even more inflated effect size estimates.

6.  The main cause of the replication crisis is undisclosed selection for significance.  Random measurement error has nothing to do with the replication crisis because random measurement error has the same effect on original and replication studies. Thus, it cannot explain why an original study was significant and a replication study failed to be significant.

Questionable Claims in Loken and Gelman’s  Backpack article. 

If you learned that a friend had run a mile in 5 minutes, you would be respectful; if you learned she had done it while carrying a heavy backpack, you would be awed. The obvious inference is that she would have been even faster without the backpack.

This makes sense. We assume that our friends’ ability is a relatively fixed ability,  everybody is slower with a heavy backpack, and the distance is really a mile, the clock was working properly, and no magic potion or tricks are involved.  As a result, we expect very little variability in our friends’ performance and an even faster time without the backpack.

But should the same intuition always be applied to research findings? Should we assume that if statistical significance is achieved in the presence of measurement error, the associated effects would have been stronger without noise?

How do we translate this analogy?  Let’s say running 1 mile in 5 minutes corresponds to statistical significance. Any time below 5 minutes is significant and any time longer than 5 minutes is not significant.  The friends’ ability is the sample size. The lager the sample size, the easier it is to get a significant result.  Finally, the backpack is measurement error.  Just like a heavy backpack makes it harder to run 1 mile in 5 minutes, more measurement error makes it harder to get significance.

The question is whether it follows that the “associated effects” (mean difference or regression coefficient that are used to estimate effect sizes) would have been stronger without random measurement error?

The answer is no.  This may not be obvious, but it directly follows from basic introductory statistics, like the formula for the t-statistic.

t-value  =  mean.difference / SD * sqrt(N)/2

and SD reflects the variability of a construct in the population plus additional variability due to measurement error.  So, measurement error increases the SD component of the t-value, but it has no effect on the effect size.

We caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger. 

With all due respect for trying to make statistics accessible, there is a trade-off between accessibility and sensibility.   First, statistical significance cannot be made stronger. A finding is either significant or it is not.  Surely a test-statistic like a t-value can be (made) stronger or weaker depending on changes in its components.  If we interpret “that which does not kill” as “obtaining a significant result with a lot of random measurement error” it is correct to expect a larger t-value and stronger evidence against the null-hypothesis in a study with a more reliable measure.  This follows directly from the effect of random error on the standard deviation in the denominator of the formula. So how can it be a fallacy to assume something that can be deduced from a mathematical formula? Maybe the authors are not talking about t-values.

It is understandable, then, that many researchers have the intuition that if they manage to achieve statistical significance under noisy conditions, the observed effect would have been even larger in the absence of noise.  As with the runner, they assume that without the burden—that is, uncontrolled variation—their effects would have been even larger.

Although this statement makes it clear that the authors are not talking about t-values, it is not clear why researchers should have the intuition that a study with a more reliable measure should produce larger effect sizes.  As shown above, random measurement error adds to the variability of observations, but it has no systematic effect on the mean difference or regression coefficient.

Now the authors introduce a second source of bias. Unlike random measurement error, this error is systematic and can lead to inflated estimates of effect sizes.

The reasoning about the runner with the backpack fails in noisy research for two reasons. First, researchers typically have so many “researcher degrees of freedom”—unacknowledged choices in how they prepare, analyze, and report their data—that statistical significance is easily found even in the absence of underlying effects and even without multiple hypothesis testing by researchers. In settings with uncontrolled researcher degrees of freedom, the attainment of statistical significance in the presence of noise is not an impressive feat.

The main reason for inferential statistics is to generalize results from a sample to a wider population.  The problem of these inductive inferences is that results in a sample vary from sample to sample. This variation is called sampling error.  Sampling error is separate from measurement error and even studies with perfect measures have sampling error and sampling error is inversely related to sample size (2/sqrt(N)).  Sampling error alone is again unbiased. It can produce larger mean differences or smaller mean differences.  However,  if studies are split into significant studies and non-significant studies,  mean differences of significant results are inflated – and mean differences of non-significant results are deflated estimates of the population mean difference.  So, effect size estimates in studies that are selected for significance are inflated. This is true, even in studies with reliable measures.

In a study with noisy measurements and small or moderate sample size, standard errors will be high and statistically significant estimates will therefore be large, even if the underlying effects are small.

To give an example,  assume there were a height difference of 1 cm between brown eyed and blue eyed individuals.  The standard deviation of height is 10 cm.  A study with 400 participants has a sampling error of 10 / sqrt(400)/2 cm  = 1 cm.  To achieve significance, the effect size has to be about twice as larger as the sampling error (t = 2 ~ p = .05).  Thus, a significant result requires a mean difference of 2 cm, which is 100% larger than the population mean difference in height.

Another researcher uses an unreliable measure (25% reliability) of height that quadruples the variance (100 cm^2 vs. 400 cm^2) and doubles the standard deviation (10cm vs. 20cm).  The sampling error also doubles to 2 cm, and now a mean difference of 4 cm is needed to achieve significance with the same t-value of 2 as in the study with the perfect measure.

The mean difference is two times larger than before and four times larger than the mean difference in the population.

The fallacy would be to look at this difference of 4 cm and to believe that an even larger difference could have been obtained with a more reliable measure.   This is a fallacy, but not for the reasons the authors suggest.  The fallacy is to assume that random measurement error in the measure of height reduced the estimate of 4cm and that an even bigger difference would be obtained with a more reliable measure.  This is a fallacy because random measurement error does not influence the mean difference of 4cm.  Instead,  it increased the standard deviation and with a more reliable measure the standard deviation would be smaller (1 cm) and the mean difference of 4 cm would have a t-value of 4 rather than 2, which is significantly stronger evidence for an effect.

How can the authors overlook that random measurement error has no influence on mean differences?  The reason is that they do not clearly distinguish between standardized and unstandardized estimates of effect sizes.

Spearman famously derived a formula for the attenuation of observed correlations due to unreliable measurement. 

Spearman’s formula applies to correlation coefficients and correlation coefficients are standardized measures of effect sizes because the covariance is divided by the standard deviations of both variables.  Similarly Cohen’s d is a standardized coefficient because the mean difference is divided by the pooled standard deviation of the two groups.

Random measurement error does clearly influence standardized effect size estimates because the standard deviation is used to standardized effect sizes.

The true population mean difference of 1 cm divided by the population standard deviation  of 10 cm yields a Cohen’s d = .10; that is one-tenth of a standard deviation difference.

In the example, the mean difference for a just significant result with a perfect measure was 2 cm, which yields a Cohen’s d = 2 cm divided by 10 cm = .2,  two-tenth of a standard deviation.

The mean difference for a just significant result with a noisy measure was 4 cm, which yields a standardized effect size of 4 cm divided by 20cm = .20, also two-tenth of a standard deviation.

Thus, the inflation of the mean difference is proportional to the increase in the standard deviation.  As a result, the standardized effect size is the same for the perfect measure and the unreliable measure.

Compared to the true mean difference of one-tenth of a standard deviation, the standardized effect sizes are both inflated by the same amount (d = .20 vs. d = .10, 100% inflation).

This example shows the main point the authors are trying to make.  Standardized effect size estimates are attenuated by random measurement error. At the same time, random measurement error increases sampling error and the mean difference has to be inflated to get significance.  This inflation already corrects for the attenuation of standardized effect sizes and any additional corrections for unreliabilty with the Spearman formula would inflate effect size estimates rather than correcting for attenuation.

This would have been a noteworthy observation, but the authors suggest that random measurement error can even have paradox effects on effect size estimates.

But in the small-N setting, this will not hold; the observed correlation can easily be larger in the presence of measurement error (see the figure, middle panel).

This statement is confusing because the most direct effect of measurement error on standardized effect sizes is attenuation.  In the height example, any observed mean difference is divided by 20 rather than 10, reducing the standardized effect sizes by 50%. The variability of these standardized effect sizes is simply a function of sample size and therefore equal.  Thus, it is not clear how a study with more measurement error can produce larger standardized effect sizes.  As demonstrated above, the inflation produced by the significance filter at most compensates for the deflation due to random measurement error.  There is simply no paradox that researchers can obtain stronger evidence (larger t-values or larger standardized effect sizes) with nosier measures even if results are selected for significance.

Our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies. If it really were true that effect sizes were always attenuated by measurement error, then it would be all the more impressive to have achieved significance.

This makes no sense. If random measurement error attenuates effect sizes, it cannot be used to justify surprisingly large mean differences.  Either we are talking about unstandardized effect sizes and they are not influenced by measurement error or we are talking about standardized effect sizes and those are attenuated by measurement error and so obtaining large mean differences is surprising.  If the true mean difference is 1 cm and an effect of 4 cm is needed to get significance with SD = 20 cm, it is surprising to get significance because the power to do so is only 17%.  Of course, it is only surprising if we knew that the population effect size is only 1 cm, but the main point is that we cannot use random measurement error to justify large effect sizes because random measurement error always attenuates standardized effect size estimates.

More confusing claims follow.

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

As explained above, random measurement error makes t-values weaker not stronger. It therefore makes no sense to attribute strong t-values to random measurement error as a potential source of variance.  The most likely explanation for strong effect sizes in studies with large sampling error is selection for significance, not random measurement error.

After all of these confusing claims the authors end with a key point.

A key point for practitioners is that surprising results from small studies should not be defended by saying that they would have been even better with improved measurement.

This is true because it is not a logical argument and not an argument researchers actually make.  The bigger problem is that researchers do not realize that their significance filter makes it necessary to find moderate to large effects and that sampling error in small samples alone can produce these effect sizes, especially when questionable research practices are being used.  No claims about hypothetically larger effect sizes are necessary or regularly made.

Next the authors simply make random statements about significance testing that reveal their ideological bias rather than adding to the understanding of t-values.

It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal to-
noise level is high. 

Of course, the t-value is a measure of the strength of evidence against the null-hypothesis, typically the hypothesis that the data were obtained without a mean difference in the population.  The larger the t-value, the less likely it is that the observed t-value could have been obtained without a population mean difference in the direction of the mean difference in the sample.  And with t-values of 4 or higher, published results also have a high probability of replicating a significant result in a replication study (Open Science Collaboration, 2015).  It can be debated whether a t-value of 2 is weak, moderate or strong evidence, but it is not debatable whether t-values provide information that can be used for inductive inferences.  Even Bayes-Factors rely on t-values.  So, the authors’ criticism of t-values makes little sense from any statistical perspective.

It is also a mistake to assume that the observed effect size would have been even larger if not for the burden of measurement error. Intuitions that are appropriate when measurements are precise are sometimes misapplied in noisy and more
probabilistic settings.

Once more these broad claims are false and misleading.  Everything else equal, estimates of standardized effect sizes are attenuated by random measurement error and would be larger if a more reliable measure had been used.  Once selection for significance is present,  the inflation introduced by selection for significance inflates standardized effect size estimates for perfect measures and it starts to disattenuate standardized effect size estimates with unreliable measures.

In the end, the authors try to link their discussion of random measurement error to the replication crisis.

The consequences for scientific replication are obvious. Many published effects
are overstated and future studies, powered by the expectation that the effects can be
replicated, might be destined to fail before they even begin. We would all run faster
without a backpack on our backs. But when it comes to surprising research findings
from small studies, measurement error (or other uncontrolled variation) should not be
invoked automatically to suggest that effects are even larger.

This is confusing. Replicability is a function of power and power is a function of the population mean difference and the sampling error of the design of a study.  Random measurement error increases sampling error, which reduces standardized effect sizes, power, and replicability.  As a result, studies with unreliable measure are less likely to produce significant results in original studies and in replication studies.

The only reason for surprising replication failures (e.g., 100% significant original studies and 25% significant replication studies for social psychology; OSC, 2015) are questionable practices that inflate the percentage of significant results in original studies.  It is irrelevant whether the original result was produced with a small population mean difference and a reliable measure or with a moderate population mean difference and an unreliable measure.  It only matters how strong the mean difference for the measure that was used is.  That is, replicability is the same for a height difference of 1 cm with a perfect measure and a standard deviation of 10 cm or a height difference of  2 cm and a noisy measure with a standard deviation of 20 cm.  However, the chance of obtaining a significant result in a study if the mean difference is 1 cm and the SD is 20 cm is lower because the noisy measure reduces the standardized effect size to Cohen’s d  = 1 cm / 20 cm = 0.05.

Conclusion

Loken and Gelman wrote a very confusing article about measurement error.  Although confusion about statistics is the norm among social scientists, it is surprising that a statistician has problems to explain basic statistical concepts and how they relate to the outcome of original and replication studies.

The most probable explanation for the confusion is that the authors seem to be believe that the combination of random measurement error and large sampling error creates a novel problem that has been overlooked.

Measurement error and selection bias thus can combine to exacerbate the replication crisis.

In the large-N scenario, adding measurement error will almost always reduce the observed correlation.  Take these scenarios and now add selection on statistical significance… for smaller N, a fraction of the observed effects exceeds the original. 

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

“Of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small”

The quotes suggest that the authors believe something extraordinary is happening in studies with large random measurement error and small samples.  However, this is not the case. Random measurement error attenuates t-values and selection for significance inflates them and these two effects are independent.  There is no evidence to suggest that random measurement error suddenly inflates effect size estimates in small samples with or without selection for significance.

Recommendations for Researchers 

It is also disconcerting that the authors fail to give recommendations how researchers can avoid fallacies, while those recommendations have been made before and would easily fix the problems associated with interpretation of effect sizes in studies with noisy measures and small samples.

The main problem in noisy studies is that point estimates of effect sizes are not a meaningful statistic.   This is not necessarily a problem Many exploratory studies in psychology aim to examine whether there is an effect at all and whether this effect is positive or negative.  A statistically significant result only allows researchers to infer that a positive or negative effect contributed to the outcome of the study (because the extreme t-value falls into a range of values that are unlikely without an effect). So, conclusions should be limited to discussion of the sign of the effect.

Unfortunately, psychologists have misinterpreted Jacob Cohen’s work and started to interpret standardized coefficients like correlation coefficients or Cohen’s d that they observed in their samples.  To make matters worse these coefficients are sometimes called observed effect sizes, as in the article by Loken and Gelman.

This might have been a reasonable term for trained statisticians, but for poorly trained psychologists it suggested that this number tells them something about the magnitude of the effect they were studying.  After all, this seems a reasonable interpretation of the term “observed effect size.”  They then used Cohen’s book to interpret these values as evidence that they obtained a small, moderate, or large effect.  In small studies, the effects have to be moderate (2 groups, n = 20, p = .05 => d = .64) to reach significance.

However, Cohen explicitly warned against this use of effect sizes. He developed standardized effect size measures to help researchers to plan studies that can provide meaningful tests of hypotheses.  A small effect size requires a large sample.  His effect sizes were develop to help researchers to plan studies. If they think an effect is small, they shouldn’t run a study with 40 participants because the study is so noisy that it is likely to fail.  So, standardized effect sizes were intended to be assumptions about unobservable population parameters.

However, psychologists ignored Cohen’s guidelines for the planning of studies. Instead they used his standardized effect sizes to examine how strong the “observed effects” in their studies were.  The misintepretation of Cohen is partially responsible for the replication crisis because researchers ignored the significance filter and were happy to report that they consistently observed moderate to large effect sizes.

However, they also consistently observed replication failures in their labs.  This was puzzling because moderate to large effects should be easy to replicate.  However, without training in statistics, social psychologists found an explanation for this variability of observed effect sizes as well: surely, the variability in observed effect sizes (!) from study to study meant that their results were highly dependent on context.  I still remember joking with some other social psychologists that effects even dependent on the color of research assistants’ shirts.  Only after reading Cohen did I understand what was really happening.  In studies with large sampling error, the “observed effect sizes” move around a lot because they are not observations of effects.  Most of the variation is mean differences from study to study is purely random sampling error.

At the end of his career, Cohen seemed to have lost faith in psychology as a science.  He wrote a dark and sarcastic article titled “The Earth is Round, p < .05.”  In this article, he proposes a simple solution for misinterpretation of “observed effect sizes” in small samples.  The abstract of this article is more informative and valuable than Loken and Gelman’s entire article.

Exploratory data analysis and the use of graphic methods, a steady improvement in
and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical
methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences,

The key point is that any sample statistic like an “effect size estimate” (not an observed effect size) has to be considered in the context of the precision of the estimate.  Nobody would take a public opinion poll seriously if it were conducted with 40 respondents and the result was a 55% chance of a candidate winning an election if this information were provided with the information that the 95%CI  ranges from 40% to 70%.

The same is true for tons of articles that reported effect size estimates without confidence intervals.  For studies with just significant results this is not a problem because significance translates into a confidence interval that does not contain the value specified by a null-hypothesis; typically zero.  For a just significant result this means that the boundary of the CI is close to zero.  So, researchers are justified in interpreting the result as evidence about the sign of an effect, but the effect size is uncertain.  Nobody would rush to buy stocks in a drug company, if they report that their new drug had an effectiveness of extending life expectancy by 1 day up to 3 years.  But if we are mislead in focusing on an observed effect size of 1.5 years, we might be foolish enough to invest in the company and lose some money.

In short, noisy studies with unreliable measures and wide confidence intervals cannot be used to make claims about effect sizes.   The reporting of standardized effect size measures can be useful for meta-analysis or to help future research in the planning of their studies, but researchers should never interpret their point estimates as observed effect sizes.

Final Conclusion

Although mathematics and statistics are fundamental sciences for all quantitative, empirical sciences each scientific discipline has its own history, terminology, and unique challenges.  Political science differs from psychology in many ways.  On the one hand, political science has access to large representative samples because there is a lot of interest in those kind of data and a lot of money is spent on collecting these data.  These data make it possible to obtain relatively precise estimates. The downside is that many data are unique to a historic context. The 2016 election in the United States cannot be replicated.

Psychology is different.  Research budgets and ethics often limit sample sizes.  However, within-subject designs with many repeated measures can increase power, something political scientists cannot do.  In addition, studies in psychology can be replicated because the results are less sensitive to a particular historic context (and yes, there are many replicable findings in psychology that generalize across time and culture).

Gelman knows about as much about psychology as I know about political science. Maybe his article is more useful for political scientists, but psychologists would be better off if they finally recognized the important contribution of one of their own methodologist.

To paraphrase Cohen: Sometimes reading less is more, except for Cohen.

A Quantitative Science Needs to Quantify Validity

Background

This article was published in a special issue in the European Journal of Personality Psychology.   It examines the unresolved issue of validating psychological measures fro the perspective of a multi-method approach (Campbell & Fiske, 1959), using structural equation modeling.

I think it provides a reasonable alternative to the current interest in modeling residual variance in personality questionnaires (network perspective) and solves the problems of manifest personality measures that are confounded by systematic measurement error.

Although latent variable models of multi-method data have been used in structural analyses (Biesanz & West, 2004; deYoung, 2006), these studies have rarely been used to estimate validity of personality measures.  This article shows how this can be done and what assumptions need to be made to interpret latent factors as variance in true personality traits.

Hopefully, sharing this article openly on this blog can generated some discussion about the future of personality measurement in psychology.

=====================================================================

What Multi-Method Data Tell Us About
Construct Validity
ULRICH SCHIMMACK*
University of Toronto Mississauga, Canada

European Journal of Personality
Eur. J. Pers. 24: 241–257 (2010)
DOI: 10.1002/per.771  [for original article]

Abstract

Structural equation modelling of multi-method data has become a popular method to
examine construct validity and to control for random and systematic measurement error in personality measures. I review the essential assumptions underlying causal models of
multi-method data and their implications for estimating the validity of personality
measures. The main conclusions are that causal models of multi-method data can be
used to obtain quantitative estimates of the amount of valid variance in measures of
personality dispositions, but that it is more difficult to determine the validity of personality measures of act frequencies and situation-specific dispositions.

Key words: statistical methods; personality scales and inventories; regression methods;
history of psychology; construct validity; causal modelling; multi-method; measurement

INTRODUCTION

Fifty years ago, Campbell and Fiske (1959) published the groundbreaking article
Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix.With close to 5000 citations (Web of Science, February 1, 2010), it is the most cited article in
Psychological Bulletin. The major contribution of this article was to outline an empirical
procedure for testing the validity of personality measures. It is difficult to overestimate the importance of this contribution because it is impossible to test personality theories
empirically without valid measures of personality.

Despite its high citation count, Campbell and Fiske’s work is often neglected in
introductory textbooks, presumably because validation is considered to be an obscure and complicated process (Borsboom, 2006). Undergraduate students of personality psychology learn little more than the definition of a valid measure as a measure that measures what it is supposed to measure.

However, they are not taught how personality psychologists validate their measures. One might hope that aspiring personality researchers learn about Campbell and Fiske’s multi-method approach during graduate school. Unfortunately, even handbooks dedicated to research methods in personality psychology pay relatively little attention to Campbell and Fiske’s (1959) seminal contribution (John & Soto, 2007; Simms & Watson, 2007). More importantly, construct validity is often introduced in qualitative terms.

In contrast, when Cronbach and Meehl (1955) introduced the concept of construct validity, they proposed a quantitative definition of construct validity as the proportion of construct-related variance in the observed variance of a personality measure. Although the authors noted that it would be difficult to obtain precise estimates of construct validity coefficients (CVCs), they stressed the importance of estimating ‘as definitely as possible the degree of validity the test is presumed to have’ (p. 290).

Campbell and Fiske’s (1959) multi-method approach paved the way to do so. Although Campbell and Fiske’s article examined construct validity qualitatively, subsequent developments in psychometrics allowed researchers to obtain quantitative estimates of construct validity based on causal models of multi-method data (Eid, Lischetzke, Nussbeck, & Trierweiler, 2003; Kenny & Kashy, 1992). Research articles in leading personality journals routinely report these estimates (Biesanz & West, 2004; DeYoung, 2006; Diener, Smith, & Fujita, 1995), but a systematic and accessible introduction to causal models of multi-method data is lacking.

The main purpose of this paper is to explain how causal models of multi-method data can be used to obtain quantitative estimates of construct validity and which assumptions these models make to yield accurate estimates.

I prefer the term causal model to the more commonly used term structural equation model because I interpret latent variables in these models as unobserved, yet real causal forces that produce variation in observed measures (Borsboom, Mellenbergh,&
van Heerden, 2003). I make the case below that this realistic interpretation of latent factors is necessary to use multi-method data for construct validation research because the assumption of causality is crucial for the identification of latent variables with construct variance (CV).

Campbell and Fiske (1959) distinguished absolute and relative (construct) validity. To
examine relative construct validity it is necessary to measure multiple traits and to look for evidence of convergent and discriminant validity in a multi-trait-multi-method matrix (Simms &Watson, 2007). However, to examine construct validity in an absolute sense, it is only necessary to measure one construct with multiple methods.

In this paper, I focus on convergent validity across multiple measures of a single construct because causal models of multi-method data rely on convergent validity alone to examine construct validity.

As discussed in more detail below, causal models of multi-method data estimate
construct validity quantitatively with the factor loadings of observed personality measures on a latent factor (i.e. an unobserved variable) that represents the valid variance of a construct. The amount of valid variance in a personality measure can be obtained by squaring its factor loading on this latent factor. In this paper, I use the terms construct validity coefficient (CVC) to refer to the factor loading and the term construct variance (CV) for the amount of valid variance in a personality measure.

Validity

A measure is valid if it measures what it was designed to measure. For example, a
thermometer is a valid measure of temperature in part because the recorded values covary with humans’ sensory perceptions of temperature (Cronbach & Meehl, 1955). A modern thermometer is a more valid measure of temperature than humans’ sensory perceptions, but the correlation between scores on a thermometer and humans’ sensory perceptions is necessary to demonstrate that a thermometer measures temperature. It would be odd to claim that highly reliable scores recorded by an expensive and complicated instrument measure temperature if these scores were unrelated to humans’ everyday perceptions of temperature.

The definition of validity as a property of a measure has important implications for
empirical tests of validity. Namely, researchers first need a clearly defined construct before they can validate a potential measure of the construct. For example, to evaluate a measure of anxiety researchers first need to define anxiety and then examine the validity of a measure as a measure of anxiety. Although the importance of clear definitions for construct validation research may seem obvious, validation research often seems to work in the opposite direction; that is, after a measure has been created psychologists examine what it measures.

For example, the widely used Positive Affect and Negative Affect Schedule (PANAS) has two scales named Positive Affect (PA) and Negative Affect (NA). These scales are based on exploratory factor analyses of mood ratings (Watson, Clark, & Tellegen, 1988). As a result, Positive Affect and Negative Affect are merely labels for the first two VARIMAX rotated principal components that emerged in these analyses. Thus, it is meaningless to examine whether the PANAS scales are valid measures of PA and NA. They are valid measures of PA and NA by definition because PA and NA are mere labels of the two VARIMAX rotated principal components that emerge in factor analyses of mood ratings.

A construct validation study would have to start with an a priori definition of Positive Affect and Negative Affect that does not refer to the specific measurement procedure that was used to create the PANAS scales. For example, some researchers have
defined Positive Affect and Negative Affect as the valence of affective experiences and
have pointed out problems of the PANAS scales as measures of pleasant and unpleasant
affective experiences (see Schimmack, 2007, for a review).

However, the authors of the PANAS do not view their measure as a measure of hedonic valence. To clarify their position, they proposed to change the labels of their scales from Positive Affect and Negative Affect to Positive Activation and Negative Activation (Watson,Wiese, Vaidya, & Tellegen, 1999). The willingness to change labels indicates that PANAS scales do not measure a priori defined constructs and as a result there is no criterion to evaluate the construct validity of the PANAS scales.

The previous example illustrates how personality measures assume a life of their own
and implicitly become the construct; that is, a construct is operationally defined by the
method that is used to measure it (Borsboom, 2006). A main contribution of Cambpell and Fiske’s (1959) article was to argue forcefully against operationalism and for a separation of constructs and methods. This separation is essential for validation research because validation research has to allow for the possibility that some of the observed variance is invalid.

Other sciences clearly follow this approach. For example, physics has clearly defined
concepts such as time or temperature. Over the past centuries, physicists have developed
increasingly precise ways of measuring these concepts, but the concepts have remained the same. Modern physics would be impossible without these advances in measurement.
However, psychologists do not follow this model of more advanced sciences. Typically, a
measure becomes popular and after it becomes popular it is equated with the construct. As a result, researchers continue to use old measure and rarely attempt to create better
measures of the same construct. Indeed, it is hard to find an example, in which one measure of a construct has replaced another measure of the same construct based on an empirical comparison of the construct validity of competing measures of the same construct (Grucza & Goldberg, 2007).

One reason for the lack of progress in the measurement of personality constructs could
be the belief that it is impossible to quantify the validity of a measure. If it were impossible to quantify the validity of a measure, then it also would be impossible to say which of two measures is more valid. However, causal models of multi-method data produce quantitative estimates of validity that allow comparisons of the validity of different measures.

One potential obstacle for construct validation research is the need to define
psychological constructs a priori without reference to empirical data. This can be difficult for constructs that make reference to cognitive processes (e.g. working memory capacity) or unconscious motives (implicit need for power). However, the need for a priori definitions is not a major problem in personality psychology. The reason is that everyday language provides thousands of relatively well-defined personality constructs (Allport & Odbert, 1936). In fact, all measures in personality psychology that are based on the lexical hypothesis assume that everyday concepts such as helpful or sociable are meaningful personality constructs. At least with regard to these relatively simple constructs, it is possible to test the construct validity of personality measures. For example, it is possible to examine whether a sociability scale really measures sociability and whether a measure of helpfulness really measures helpfulness.

Convergent validity

I start with a simple example to illustrate how psychologists can evaluate the validity of a
personality measure. The concept is people’s weight.Weight can be defined as ‘the vertical force exerted by a mass as a result of gravity’ (wordnet.princeton.edu). In the present case, only the mass of human adults is of interest. The main question, which has real practical significance in health psychology (Kroh, 2005), is to examine the validity of self-report measures of weight because it is more economical to use self-reports than to weigh people with scales.

To examine the validity of self-reported weight as a measure of actual weight, it is
possible to obtain self-reports of weight and an objective measure of weight from the same individuals. If self-reports of weight are valid, they should be highly correlated with the objective measure of weight. In one study, participants first reported their weight before their weight was objectively measured with a scale several weeks later (Rowland, 1990). The correlation in this study was r (N =11,284) =.98. The implications of this finding for the validity of self-reports of weight depend on the causal processes that underlie this correlation, which can be examined by means of causal modelling of correlational data.

It is well known that a simple correlation does not reveal the underlying causal process,
but that some causal process must explain why a correlation was observed (Chaplin, 2007). Broadly speaking, a correlation is determined by the strength of four causal effects, namely, the effect of observed variable A on observed variable B, the effect of observed variable B on observed variable A, and the effects of an unobserved variable C on observed variable A and on observed variable B.

In the present example, the observed variables are the self-reported weights and those recorded by a scale. To make inferences about the validity of self-reports of weight it is necessary to make assumptions about the causal processes that produce a correlation between these two methods. Fortunately, it is relatively easy to do so in this example. First, it is fairly certain that the values recorded by a scale are not influenced by individuals’ self-reports. No matter how much individuals insist that the scale is wrong, it will not change its score. Thus, it is clear that the causal effect of self-reports on
the objective measure is zero. It is also clear that self-reports of weight in this study were
not influenced by the objective measurement of weight in this study because self-reports
were obtained weeks before the actual weight was measured. Thus, the causal effect of the objectively recorded scores on self-rating is also zero. It follows that the correlation of r =.98 must have been produced by a causal effect of an unobserved third variable. A
plausible third variable is individuals’ actual mass. It is their actual mass that causes the
scale to record a higher or lower value and their actual mass also caused them to report a specific weight. The latter causal effect is probably mediated by prior objective
measurements with other scales, and the validity of these scales would influence the
validity of self-reports among other factors (e.g. socially desirable responding). In combination, the causal effects of actual mass on self-reports and on the scale produce the observed correlation of r =.98. This correlation is not sufficient to determine how strong the effects of weight on the two measures are. It is possible that the scale was a perfect measure of weight. In this case, the correlation between weight and the values recorded by the scale is 1. It follows, that the size of the effect of weight on self-reports of weight (or the factor loading of self-reported weight on the weight factor) has to be r =.98 to produce the observed correlation of r =.98 (1 *.98 = .98). In this case, the CVC of the self-report measure of weight would be .98. However, it is also possible that the scale is a slightly imperfect measure of weight. For example, participants may not have removed their shoes before stepping on the scale and differences in the weight of shoes (e.g. boots versus sandals) could have produced measurement error in the objective measure of individuals’ true weight. It is also possible that changes in weight over time reduce the validity of objective scores as a validation criterion for self-ratings several weeks earlier. In this case, the estimate underestimates the validity of self-ratings.

In the present context, the reasons for the lack of perfect convergent validity are irrelevant. The main point of this example was to illustrate how the correlation between two independent measures of the same construct can be used to obtain quantitative estimates of the validity of a personality measure. In this example, a conservative estimate of the CVC of self-reported weight as a measure of weight is .98 and the estimated amount of CVin the self-report measure is 96% (.98^2 = .96).
The example of self-reported weight was used to establish four important points about
construct validity. First, the example shows that convergent validity is sufficient to examine construct validity. The question of how self-reports of weight are related to measures of other constructs (e.g. height, social desirable responding) can be useful to examine sources of measurement error, but correlations with measures of other constructs are not needed to estimate CVCs. Second, empirical tests of construct validity do not have to be an endless process without clear results (Borsboom, 2006). At least for some self-report measures it is possible to provide a meaningful answer to the question of their validity. Third, validity is a quantitative construct. Qualitative conclusions that a measure is valid because validity is not zero (CVC>0, p<.05) or that a measure is invalid because validity is not perfect (CVC<1.0, p<.05) are not very helpful because most measures are valid and invalid (0<CVC<1). As a result, qualitative reviews of validity studies are often the source of fruitless controversies (Schimmack & Oishi, 2005). The validity of personality measures should be estimated quantitatively like other psychometric properties such as reliability coefficients, which are routinely reported in research articles (Schmidt & Hunter, 1996).

Validity is more important than reliability because reliable and invalid measures are
potentially more dangerous than unreliable measures (Blanton & Jaccard, 2006). Moreover, it is possible that a less reliable measure is more valid than a more reliable measure if the latter measure is more strongly contaminated by systematic measurement error (John & Soto, 2007). A likely explanation for the emphasis on reliability is the common tendency to equate constructs with measures. If a construct is equated with a measure, only random error can undermine the validity of a measure. The main contribution of Campbell and Fiske (1959) was to point out that systematic measurement error can also threaten the validity of personality measures. As a result, high reliability is insufficient evidence for the validity of a personality measure (Borsboom & Mellenbergh, 2002).

The third point illustrated in this example is that tests of convergent validity require
independent measures. Campbell and Fiske (1959) emphasized the importance of
independent measures when they defined convergent validity as the correlation between ‘maximally different methods’ (p. 83). In a causal model of multi-method data the independence assumption implies that the only causal effects that produce a correlation between two measures of the same construct are the causal effect of the construct on the two measures. This assumption implies that all the other potential causal effects that can produce correlations among observed measures have an effect size of zero. If this assumption is correct, the shared variance across independent methods represents CV. It is then possible to estimate the proportion of the shared variance relative to the total observed variance of a personality measure as an estimate of the amount of CV in this measure. For example, in the previous example I assumed that actual mass was the only causal force that contributed to the correlation between self-reports of weight and objective scale scores. This assumption would be violated if self-ratings were based on previous measurements with objective scales (which is likely) and objective scales share method variance that does not reflect actual weight (which is unlikely). Thus, even validation studies with objective measures implicitly make assumptions about the causal model underling these correlations.

In sum, the weight example illustrated how a causal model of the convergent validity
between two measures of the same construct can be used to obtain quantitative estimates of the construct validity of a self-report measure of a personality characteristic. The following example shows how the same approach can be used to examine the construct validity of measures that aim to assess personality traits without the help of an objective measure that relies on well-established measurement procedures for physical characteristics like weight.

CONVERGENT VALIDITY OF PERSONALITY MEASURES

A Hypothetical Example

I use helpfulness as an example. Helpfulness is relatively easy to define as ‘providing
assistance or serving a useful function’ (wordnetweb.princeton.edu/perl/webwn). Helpful can be used to describe a single act or an individual. If helpful is used to describe a single act, helpful is not only a characteristic of a person because helping behaviour is also influenced by situational factors and interactions between personality and situational factors. Thus, it is still necessary to provide a clearer definition of helpfulness as a personality characteristic before it is possible to examine the validity of a personality measure of helpfulness.

Personality psychologists use trait concepts like helpful in two different ways. The most
common approach is to define helpful as an internal disposition. This definition implies
causality. There are some causal factors within an individual that make it more likely for
this individual to act in a helpful manner than other individuals. The alternative approach is to define helpfulness as the frequency with which individuals act in a helpful manner. An individual is helpful if he or she acted in a helpful manner more often than other people. This approach is known as the act frequency approach. The broader theoretical differences between these two approaches are well known and have been discussed elsewhere (Block, 1989; Funder, 1991; McCrae & Costa, 1995). However, the implications of these two definitions of personality traits for the interpretation of multi-method data have not been discussed. Ironically, it is easier to examine the validity of personality measures that aim to assess internal dispositions that are not directly observable than to do so for personality measures that aim to assess frequencies of observable acts. This is ironic because intuitively it seems to be easier to count the frequency of observable acts than to measure unobservable internal dispositions. In fact, not too long ago some psychologists doubted that internal dispositions even exist (cf. Goldberg, 1992).

The measurement problem of the act frequency approach is that it is quite difficult to
observe individuals’ actual behaviours in the real world. For example, it is no trivial task to establish how often John was helpful in the past month. In comparison it is relatively easy to use correlations among multiple imperfect measures of observable behaviours to make inferences about the influence of unobserved internal dispositions on behaviour.

Figure 1. Theoretical model of multi-method data. Note. T = trait (general disposition); AF-c, AF-f, AF-s  = act frequencies with colleague, friend and spouse; S-c, S-f, S-s =situational and person x situation interaction effects on act frequencies; R-c, R-f, R-s = reports by colleague, friend and spouse; E-c, E-f, E-s =errors in reports by
colleague, friend and spouse.

Figure 1 illustrates how a causal model of multi-method data can be used for this purpose. In Figure 1, an unobserved general disposition to be helpful influences three observed measures of helpfulness. In this example, the three observed measures are informant ratings of helpfulness by a friend, a co-worker and a spouse. Unlike actual informant ratings in personality research, informants in this hypothetical example are only asked to report how often the target helped them in the past month. According to Figure 1, each informant report is influenced by two independent factors, namely, the actual frequency of helpful acts towards the informant and (systematic and random) measurement error in the reported frequencies of helpful acts towards the informant. The actual frequency of helpful acts is also influenced by two independent factors. One factor represents the general disposition to be helpful that influences helpful behaviours across situations. The other factor represents situational factors and person-situation interaction effects. To fully estimate all coefficients in this model (i.e. effect sizes of the postulated causal effects), it would be necessary to separate measurement error and valid variance in act frequencies.

This is impossible if, as in Figure 1, each act frequency is measured with a single method,
namely, one informant report. In contrast, the influence of the general disposition is
reflected in all three informant reports. As a result, it is possible to separate the variance due to the general disposition from all other variance components such as random error,
systematic rating biases, situation effects and personsituation interaction effects. It is
then possible to determine the validity of informant ratings as measures of the general
disposition, but it is impossible to (precisely) estimate the validity of informant ratings as
measures of act frequencies because the model cannot distinguish reporting errors from
situational influences on helping behaviour.

The causal model in Figure 1 makes numerous independence assumptions that specify
Campbell and Fiske’s (1959) requirement that traits should be assessed with independent
methods. First, the model assumes that biases in ratings by one rater are independent of
biases in ratings by other raters. Second, it assumes that situational factors and
person by situation interaction effects that influence helping one informant are independent of the situational and personsituation factors that influence helping other informants. Third, it assumes that rating biases are independent of situation and person by situation interaction effects for the same rater and across raters. Finally, it assumes that rating biases and situation effects are independent of the global disposition. In total, this amounts to 21 independence assumptions (i.e. Figure 1 includes seven exogeneous variables, that is, variables that do not have an arrow pointing at them, which implies 21 (7×6/2) relationships that the model assumes to be zero). If these independence assumptions are correct, the correlations among the three informant ratings can be used to determine the variation in the unobserved personality disposition to be helpful with perfect validity. This variance can then be used like the objective measure of weight in the previous example as the validation criterion for personality measures of the general
disposition to be helpful (e.g. self-ratings of general helpfulness). In sum, Figure 1
illustrates that a specific pattern of correlations among independent measures of the same construct can be used to obtain precise estimates of the amount of valid variance in a single measure.

The main challenge for actual empirical studies is to ensure that the methods in a multi-method model fulfill the independence assumptions. The following examples demonstrate the importance of the neglected independence assumption for the correct interpretation of causal models of multi-method data. I also show how researchers can partially test the independence assumption if sufficient methods are available and how researchers can estimate the validity of personality measures that aggregate scores from independent methods. Before I proceed, I should clarify that strict independence of methods is unlikely, just like other null-hypotheses are likely to be false. However, small violations of the independence assumption will only introduce small biases in estimates of CVCs.

Example 1: Multiple response formats

The first example is a widely cited study of the relation between Positive Affect and
Negative Affect (Green, Goldman,&Salovey, 1993). I chose this paper because the authors
emphasized the importance of a multi-method approach for the measurement of affect,
while neglecting Campbell and Fiske’s requirement that the methods should be maximally different. A major problem for any empirical multi-method study is to find multiple independent measures of the same construct. The authors used four self-report measures with different response formats for this purpose. However, the variation of response formats can only be considered a multi-method study, if one assumes that responses on one response format are independent of responses on the other response formats so that correlations across response formats can only be explained by a common causal effect of actual momentary affective experiences on each response format. However, the validity of all self-report measures depends on the ability and willingness of respondents to report their experiences accurately. Violations of this basic assumption introduce shared method variance among self-ratings on different response formats. For example, socially desirable responding can inflate ratings of positive experiences across response formats. Thus, Green et al.’s (1993) study assumed rather than tested the validity of self-ratings of momentary affective experiences. At best, their study was able to examine the contribution of stylistic tendencies in the use of specific response formats to variance in mood ratings, but these effects are known to be small (Schimmack, Bockenholt, & Reisenzein, 2002). In sum, Green et al.’s (1993) article illustrates the importance of critically examining the similarity of methods in a multi-method study. Studies that use multiple self-report measures that vary response formats, scales, or measurement occasions should not be considered multi-method studies that can be used to examine construct validity.

Example 2: Three different measures

The second example of a multi-method study also examined the relation between Positive Affect and Negative Affect (Diener et al., 1995). However, it differs from the previous example in two important ways. First, the authors used more dissimilar methods that are less likely to violate the independence assumption, namely, self-report of affect in the past month, averaged daily affect ratings over a 6 week period and averaged ratings of general affect by multiple informants. Although these are different methods, it is possible that these methods are not strictly independent. For example, Diener et al. (1995) acknowledge that all three measures could be influenced by impression management. That is, retrospective and daily self-ratings could be influenced by social desirable responding, and informant ratings could be influenced by targets’ motivation to hide negative emotions from others. A common influence of impression management on all three methods would inflate validity estimates of all three methods.

For this paper, I used Diener et al.’s (1995) multi-method data to estimate CVCs for the
three methods as measures of general dispositions that influence people’s positive and
negative affective experiences. I used the data from Diener et al.’s (1995) Table 15 that are reproduced in Table 1. I used MPLUS5.1 for these analyses and all subsequent analyses (Muthen & Muthen, 2008). I fitted a simple model with a single latent variable that represents a general disposition that has causal effects on the three measures. Model fit was perfect because a model with three variables and three parameters has zero degrees of freedom and can perfectly reproduce the observed pattern of correlations. The perfect fit implies that CVC estimates are unbiased if the model assumptions are correct, but it also implies that the data are unable to test model assumptions.
These results suggest impressive validity of self-ratings of affect (Table 2). In contrast,
CVC estimates of informant ratings are considerably lower, despite the fact that informant ratings are based on averages of several informants. The non-overlapping confidence intervals for self-ratings and informant ratings indicate that this difference is statistically significant. There are two interpretations of this pattern. On the one hand, it is possible that informants are less knowledgeable about targets’ affective experiences. After all, they do not have access to information that is only available introspectively. However, this privileged information does not guarantee that self-ratings are more valid because individuals only have privileged information about their momentary feelings in specific situations rather than the internal dispositions that influence these feelings. On the other hand, it is possible that retrospective and daily self-ratings share method variance and do not fulfill the independence assumption. In this case, the causal model would provide inflated estimates of the validity of self-ratings because it assumes that stronger correlations between retrospective and daily self-ratings reveal higher validity of these methods, when in reality the higher correlation is caused by shared method effects. A study with three methods is unable to test these alternative explanations.

Example 3: Informants as multiple methods

One limitation of Diener et al.’s (1995) study was the aggregation of informant ratings.
Although aggregated informant ratings provide more valid information than ratings by a
single informant, the aggregation of informant ratings destroys valuable information about the correlations among informant ratings. The example in Figure 1 illustrated that ratings by multiple informants provide one of the easiest ways to measure dispositions with multiple methods because informants are more likely to base their ratings on different situations, which is necessary to reveal the influence of internal dispositions.

Example 3 shows how ratings by multiple informants can be used in construct validation research. The data for this example are based on multi-method data from the Riverside Accuracy Project (Funder, 1995; Schimmack, Oishi, Furr, & Funder, 2004). To make the CVC estimates comparable to those based on the previous example, I used scores on the depression and cheerfulness facets of the NEO-PI-R (Costa&McCrae, 1992). These facets are designed to measure affective dispositions. The multi-method model used self-ratings and informant ratings by parents, college friends and hometown friends as different methods.

Table 3 shows the correlation matrices for cheerfulness and depression. I first fitted a causal model that assumed independence of all methods to the data. The model also included sum scores of observed measures to examine the validity of aggregated informant ratings and an aggregated measure of all four raters (Figure 2). Model fit was evaluated using standard criteria of model fit, namely, comparative fit index (CFI)>.95, root mean square error of approximation (RMSEA)<.06 and standardized
root mean residuals (SRMR)<.08.

Neither cheerfulness, chi2 (df =2, N =222) = 11.30, p<.01, CFI =.860, RMSEA =.182, SRMR = .066, nor depression, chi2 (df =2, N = 222) = 8.31, p =.02,  CFI =. 915, RMSEA = .150, SRMR =.052, had acceptable CFI and RSMEA values.

One possible explanation for this finding is that self-ratings are not independent of informant ratings because self-ratings and informant ratings could be partially based on overlapping situations. For example, self-ratings of cheerfulness could be heavily influenced by the same situations that are also used by college friends to rate cheerfulness (e.g. parties). In this case, some of the agreement between self-ratings and informant ratings by college friends would reflect the specific situational factors of
overlapping situations, which leads to shared variance between these ratings that does not reflect the general disposition. In contrast, it is more likely that informant ratings are independent of each other because informants are less likely to rely on the same situations (Funder, 1995). For example, college friends may rely on different situations than parents.

To examine this possibility, I fitted a model that included additional relations between  self-ratings and informant ratings (dotted lines in Figure 2). For cheerfulness, an additional relation between self-ratings and ratings by college friends was sufficient to achieve acceptable model fit, chi2 (df =1, N =222) =0.08, p =.78, CFI =1.00, RMSEA =.000,
SRMR =.005. For depression, additional relations of self-ratings to ratings by college
friends and parents were necessary to achieve acceptable model fit. Model fit of this model was perfect because it has zero degrees of freedom. In these models, CVC can no longer be estimated by factor loadings alone because some of the valid variance in self-ratings is also shared with informant ratings. In this case, CVC estimates represent the combined total effect of the direct effect of the latent disposition factor on self-ratings and the indirect effects that are mediated by informant ratings.

I used the model indirect option of MPLUS5.1 to estimate the total effects in a model that  also included sum scores with equal weights for the three informant ratings and all four ratings.  Table 4 lists the CVC estimates for the four ratings and the two measures based on aggregated ratings.

The CVC estimates of self-ratings are considerably lower than those based on Diener
et al.’s (1995) data. Moreover, the results suggest that in this study aggregated informant
ratings are more valid than self-ratings, although the confidence intervals overlap. The
results for the aggregated measure of all four raters show that adding self-ratings to
informant ratings did not increase validity above and beyond the validity obtained by
aggregating informant ratings.

These results should not be taken too seriously because they are based on a single,
relatively small sample. Moreover, it is important to emphasize that these CVC estimates
depend on the assumption that informant ratings do not share method variance. Violation of this assumption would lead to an underestimation of the validity of self-ratings. For example, an alternative assumption would be that personality changes. As a result, parent ratings and ratings by hometown friends may share variance because they are based in part on situations before personality changed, whereas college friends’ ratings are based on more recent situations. This model fits the data equally well and leads to much higher estimates of CV in self-ratings. To test these competing models it would be necessary to include additional measures. For example, standardized laboratory tasks and biological measures could be added to the design to separate valid variance from shared rating biases by informants.

These inconsistent findings might suggest that it is futile to obtain wildly divergent quantitative estimates of construct validity. However, the same problem arises in other research areas and it can be addressed by designing better studies that test assumptions that cannot be tested in existing data sets. In fact, I believe that publication of conflicting validity estimates will stimulate research on construct validity, whereas the view of construct validation research as an obscure process without clear results has obscured the lack of knowledge about the validity of personality measures.

IMPLICATIONS

I used two multi-method datasets to illustrate how causal models of multi-method data can be used to estimate the validity of personality measures. The studies produced different results. It is not the purpose of this paper to examine the sources of disagreement. The results merely show that it is difficult to make general claims about the validity of commonly used personality measures. Until more precise information becomes available, the results suggest that about 30–70% of the variance in self-ratings and single informant ratings is CV. Until more precise estimates become available I suggest an estimate of 50 +/- 20% as a rough estimate of construct validity of personality ratings.

I suggest the verbal labels low validity for measures with less than 30% CV (e.g. implicit measures of well-being, Walker & Schimmack, 2008), moderate validity for measures with 30–70% CV (most self-report measures of personality traits) and high validity for measures with more than 70% CV (self-ratings of height and weight). Subsequently, I briefly discuss the practical implications of using self-report measures with moderate validity to study the causes and consequences of personality dispositions.

Correction for invalidity

Measurement error is nearly unavoidable, especially in the measurement of complex
constructs such as personality dispositions. Schmidt and Hunter (1996) provided
26 examples of how the failure to correct for measurement error can bias substantive
conclusions. One limitation of their important article was the focus on random
measurement error. The main reason is probably that information about random
measurement error is readily available. However, invalid variance due to systematic
measurement error is another factor that can distort research findings. Moreover, given
the moderate amount of valid variance in personality measures, corrections for invalidity are likely to have more dramatic practical implications than corrections for unreliability. The following examples illustrate this point.

Hundreds of twin studies have examined the similarity between MZ and DZ twins to
examine the heritability of personality characteristics. A common finding in these studies are moderate to large MZ correlations (r =.3–.5) and small to moderate (r =.1–.3) DZ correlations. This finding has led to the conclusion that approximately 40% of the variance is heritable and 60% of the variance is caused by environmental factors. However, this interpretation of twin data fails to take measurement error into account. As it turns out, MZ correlations approach, if not exceed, the amount of validity variance in personality measures as estimated by multi-method data. In other words, ratings by two different individuals of two different individuals (self-ratings by MZ twins) tend to correlate as highly with each other as those of a single individual (self ratings and informant ratings of a single target). This finding suggests that heritability estimates based on mono-method studies severely underestimate heritability of personality dispositions (Riemann, Angleitner, & Strelau, 1997). A correction for invalidity would suggest that most of the valid variance is heritable (Lykken&Tellegen, 1996). However, it is problematic to apply a direct correction for invalidity to twin data because this correction assumes that the independence assumption is valid. It is better to combine a multi-method assessment with a twin design (Riemann et al., 1997). It is also important to realize that multi-method models focus on internal dispositions rather than act frequencies. It makes sense that heritability estimates of internal dispositions are higher than heritability estimates of act frequencies because act frequencies are also influenced by situational factors.

Stability of personality dispositions

The study of stability of personality has a long history in personality psychology (Conley,
1984). However, empirical conclusions about the actual stability of personality are
hampered by the lack of good data. Most studies have relied on self-report data to examine this question. Given the moderate validity of self-ratings, it is likely that studies based on self-ratings underestimate true stability of personality. Even corrections for unreliability alone are sufficient to achieve impressive stability estimates of r =.98 over a 1-year interval (Anusic & Schimmack, 2016; Conley, 1984). The evidence for stability of personality from multi-method studies is even more impressive. For example, one study reported a retest correlation of r =.46 over a 26-year interval for a self-report measure of neuroticism (Conley, 1985). It seems possible that personality could change considerably over such a long time period. However, the study also included informant ratings of personality. Self-informant agreement on the same occasion was also r =.46. Under the assumption that self-ratings and informant ratings are independent methods and that there is no stability in method variance, this pattern of correlations would imply that variation in neuroticism did not change at all over this 26-year period (.46/.46 =1.00). However, this conclusion rests on the validity of the assumption that method variance is not stable. Given the availability of longitudinal multi-method data it is possible to test this assumption. The relevant information is contained in the cross-informant, cross-occasion correlations. If method  variance was unstable, these correlations should also be r =.46. In contrast, the actual correlations are lower, r =.32. This finding indicates that (a) personality dispositions changed and (b) there is some stability in the method variance. However, the actual stability of personality dispositions is still considerably higher (r =.32/.46 =.70) than one would have inferred from the observed retest correlation r =.46 of self-ratings alone. A retest correlation of r =.70 over a 26-year interval is consistent with other estimates that the stability of personality dispositions is about r =.90 over a 10-year period and r =.98 over a 1-year period (Conley, 1984; Terracciano, Costa, & McCrae, 2006) and that the majority of the variance is due to stable traits that never change (Anusic & Schimmack, 2016). The failure to realize
that observed retest correlations underestimate stability of personality dispositions can be costly because it gives personality researchers a false impression about the likelihood of finding empirical evidence for personality change. Given the true stability of personality it is necessary to wait a long time or to use large sample sizes and probably best to do both (Mroczek, 2007).

Prediction of behaviour and life outcomes

During the person-situation debate, it was proposed that a single personality trait predicts less than 10% of the variance in actual behaviours. However, most of these studies relied on self-ratings of personality to measure personality. Given the moderate validity of self-ratings, the observed correlation severely underestimates the actual effect of personality traits on behaviour. For example, a recent meta-analysis reported an effect size of conscientiousness on GPA of r =.24 (Noftle & Robins, 2007). Ozer (2007) points out
that strictly speaking the correlation between self-reported conscientiousness and GPA
does not represent the magnitude of a causal effect.

Assuming 40% valid variance in self-report measures of conscientiousness (DeYoung, 2006), the true effect size of a conscientious disposition on GPA is r =.38 (.24/sqrt(.40)). As a result, the amount of explained variance in GPA increases from 6% to 14%. Once more, failure to correct for invalidity in personality measures can be costly. For example, a personality researcher might identify seven causal factors that independently produce observed effect size estimates of r =.24, which suggests that these seven factors explain less than 50% of the variance in GPA (7 * .24^2 =42%). However, decades of future research are unable to uncover additional predictors of GPA. The reason could be that the true amount of explained variance is nearly 100% and that the unexplained variance is due to invalid variance in personality measures (7 * .38^2 =100%).

CONCLUSION

This paper provided an introduction to the logic of a multi-method study of construct
validity. I showed how causal models of multi-method data can be used to obtain
quantitative estimates of the construct validity of personality measures. I showed that
accurate estimates of construct validity depend on the validity of the assumptions
underlying a causal model of multi-method data such as the assumption that methods are independent. I also showed that multi-method studies of construct validity require
postulating a causal construct that can influence and produce covariances among
independent methods. Multi-method studies for other constructs such as actual behaviours or act frequencies are more problematic because act frequencies do not predict a specific pattern of correlations across methods. Finally, I presented some preliminary evidence that commonly used self-ratings of personality are likely to have a moderate amount of valid variance that falls broadly in a range from 30% to 70% of the total variance. This estimate is consistent with meta-analyses of self-informant agreement (Connolly, Kavanagh, & Viswesvaran, 2007; Schneider & Schimmack, 2009). However, the existing evidence is limited and more rigorous tests of construct validity are needed. Moreover studies with large, representative samples are needed to obtain more precise estimates of construct validity (Zou, Schimmack, & Gere, 2013). Hopefully, this paper will stimulate more research in this fundamental area of personality psychology by challenging the description of construct validity research as a Kafkaesque pursuit of an elusive goal that can never be reached (cf. Borsboom, 2006). Instead empirical studies of construct validity are a viable and important scientific enterprise that faces the same challenges as other studies in personality psychology that try
to make sense of correlational data.

REFERENCES

Allport, G. W., & Odbert, H. S. (1936). Trait-names a psycho-lexical study. Psychological
Monographs, 47(1), 1–171.

Anusic, I. & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, Vol 110(5), May 2016, 766-781. 

Biesanz, J. C., &West, S. G. (2004). Towards understanding assessments of the Big Five: Multitraitmultimethod analyses of convergent and discriminant validity across measurement occasion and type of observer. Journal of Personality, 72(4), 845–876.

Blanton, H., & Jaccard, J. (2006). Arbitrary metrics redux. American Psychologist, 61(1), 62–71.

Block, J. (1989). Critique of the act frequency approach to personality. Journal of Personality and Social Psychology, 56(2), 234–245.

Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425–440.

Borsboom, D., &Mellenbergh, G. J. (2002). True scores, latent variables, and constructs: A comment on Schmidt and Hunter. Intelligence, 30(6), 505–514.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110(2), 203–219.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56(2), 81–105.

Chaplin,W. F. (2007). Moderator and mediator models in personality research: A basic introduction. In R.W. Robins, C. R. Fraley,&R. F. Krueger (Eds.), Handbook of research methods in personality psychology (602–632). New York, NY: Guilford Press.

Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality and self-opinion. Personality and Individual Differences, 5(1), 11–25.

Conley, J. J. (1985). Longitudinal stability of personality traits: A multitrait-multimethod-multioccasion analysis. Journal of Personality and Social Psychology, 49(5), 1266–1282.

Connolly, J. J., Kavanagh, E. J., & Viswesvaran, C. (2007). The convergent validity between self and observer ratings of personality: A meta-analytic review. International Journal of Selection and Assessment, 15(1), 110–117.

Costa, J. P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEOPI-R) and Five Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment Resources.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.

DeYoung, C. G. (2006). Higher-order factors of the Big Five in a multi-informant sample. Journal of Personality and Social Psychology, 91(6), 1138–1151.

Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and Social Psychology, 69(1), 130–141.

Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from trait-specific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8(1), 38–60.
[Assumes a gold standard method without systematic measurement error (e.g., an objective measure of height or weight is available]

Funder, D. C. (1991). Global traits—a Neo-Allportian approach to personality. Psychological Science, 2(1), 31–39.

Funder, D. C. (1995). On the accuracy of personality judgment—a realistic approach. Psychological Review, 102(4), 652–670.

Goldberg, L. R. (1992). The social psychology of personality. Psychological Inquiry, 3, 89–94.

Green, D. P., Goldman, S. L., & Salovey, P. (1993). Measurement error masks bipolarity in affect ratings. Journal of Personality and Social Psychology, 64(6), 1029–1041.

Grucza, R. A., & Goldberg, L. R. (2007). The comparative validity of 11 modern personality
inventories: Predictions of behavioral acts, informant reports, and clinical indicators. Journal of Personality Assessment, 89(2), 167–187.

John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability and the process of construct validation. In R. W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (461–494). New York, NY: Guilford Press.

Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multitrait-multimethod matrix by confirmatory factor analysis. Psychological Bulletin, 112(1), 165–172.

Kroh, M. (2005). Effects of interviews during body weight checks in general population surveys. Gesundheitswesen, 67(8–9), 646–655.

Lykken, D., & Tellegen, A. (1996). Happiness is a stochastic phenomenon. Psychological Science, 7(3), 186–189.

McCrae, R. R.,&Costa, P. T. (1995). Trait explanations in personality psychology. European Journal of Personality, 9(4), 231–252.

Mroczek, D. K. (2007). The analysis of longitudinal data in personality research. In R.W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 543–556). New York, NY, US: Guilford Press.

Muthen, L. K., & Muthen, B. O. (2008). Mplus user’s guide (5th ed.). Los Angeles, CA: Muthen & Muthen. 

Noftle, E. E., & Robins, R. W. (2007). Personality predictors of academic outcomes: Big five
correlates of GPA and SAT scores. Journal of Personality and Social Psychology, 93(1), 116–130.

Ozer, D. J. (2007). Evaluating effect size in personality research. In R.W. Robins, C. R. Fraley, & R. F. Krueger (Eds.). New York, NY, US: Guilford Press.

Riemann, R., Angleitner, A., & Strelau, J. (1997). Genetic and environmental influences on personality: A study of twins reared together using the self- and peer-report NEO-FFI scales. Journal of Personality, 65(3), 449–475.

Robins, R. W., & Beer, J. S. (2001). Positive illusions about the self: Short-term benefits and long-term costs. Journal of Personality and Social Psychology, 80(2), 340–352.

Rowland, M. L. (1990). Self-reported weight and height. American Journal of Clinical Nutrition, 52(6), 1125–1133.

Schimmack, U. (2007). The structure of subjective well-being. In M. Eid, & R. J. Larsen (Eds.), The science of subjective well-being (pp. 97–123). New York: Guilford.

Schimmack, U., Bockenholt, U.,&Reisenzein, R. (2002). Response styles in affect ratings: Making a mountain out of a molehill. Journal of Personality Assessment, 78(3), 461–483.

Schimmack, U., & Oishi, S. (2005). The influence of chronically and temporarily accessible
information on life satisfaction judgments. Journal of Personality and Social Psychology,
89(3), 395–406.

Schimmack, U., Oishi, S., Furr, R. M., & Funder, D. C. (2004). Personality and life satisfaction: A facet-level analysis. Personality and Social Psychology Bulletin, 30(8), 1062–1075.

Schmidt, F. L.,& Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1(2), 199–223.

Schneider, L., & Schimmack, U. (2009). Self-informant agreement in well-being ratings: A metaanalysis. Social Indicators Research, 94, 363–376.

Simms, L. J., & Watson, D. (2007). The construct validation approach to personality scale
construction. In R. W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research
methods in personality psychology (240–258). New York, NY: Guilford Press.

Terracciano, A., Costa, J. P. T., & McCrae, R. R. (2006). Personality plasticity after age 30.
Personality and Social Psychology Bulletin, 32, 999–1009.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness Implicit Association Test as a measure of subjective well-being. Journal of Research in Personality, 42(2), 490–497.

Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: The PANAS Scales. Journal of Personality and Social Psychology, 54(6), 1063–1070.

Watson, D., Wiese, D., Vaidya, J., & Tellegen, A. (1999). The two general activation systems of affect: Structural findings, evolutionary considerations, and psychobiological evidence. Journal of Personality and Social Psychology, 76(5), 820–838.

Zou, C., Schimmack, U., & Gere, J. (2013).  The validity of well-being measures: A multiple-indicator–multiple-rater model.  Psychological Assessment, 25, 1247-1254.