All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Confused about statistics? Read more Cohen (and less Loken & Gelman)

*** Background.  The Loken and Gelman article “Measurement Error and the Replication Crisis” created a lot of controversy in the Psychological Methods Discussion Group. I believe the article is confusing and potentially misleading. For example, the authors do not clearly distinguish between unstandardized and standardized effect size measures, although random measurement error has different consequences for one or the other.  I think a blog post by Gelman makes it clear what the true purpose of the article. is.

We talked about why traditional statistics are often counterproductive to research in the human sciences.

This explains why the article tries to construct one more fallacy in the use of traditional statistics, but fails to point out a simple solution to avoid this fallacy.  Moreover, I argue in the blog post that Loken and Gelman committed several fallacies on their own in an attempt to discredit t-values and significance testing.

I asked Gelman to clarify several statements that made no sense to me. 

 “It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal-to-noise level is high”  (Loken and Gelman)

Ulrich Schimmack
Would you say that there is no meaningful difference between a z-score of 2 and a z-score of 4? These z-scores are significantly different from each other. Why would we not say that a study with a z-score of 4 provides stronger evidence for an effect than a study with a z-score of 2?

  • Andrew says:

    Ulrich:

    Sure, fair enough. The z-score provides some information. I guess I’d just say it provides less information than people think.

 

I believe that the article contains many more statements that are misleading and do not inform readers how t-values and significance testing works.  Maybe the article is not as bad as I think it is, but I am pretty sure that it provides less information than people think.

In contrast, Jacob Cohen has  provided clear and instructive recommendations for psychologists to improve their science.  If psychologists had listened to him, we wouldn’t have a replication crisis.

The main points to realize about random measurement error and replicability are.

1.  Neither population nor sample mean differences (or covariances) are effect sizes. They are statistics that provide some information about effects and the magnitude of effects.  The main problem in psychology has been the interpretation of mean differences in small samples as “observed effect sizes”  Effects cannot be observed.

2.  Point estimates of effect sizes vary from sample to sample.  It is incorrect to interpret a point estimate as information about the size of an effect in a sample or a population. To avoid this problem, researchers should always report a confidence interval of plausible effect sizes. In small samples with just significant results these intervals are wide and often close to zero.  Thus, no research should interpret a moderate to large point estimate, when effect sizes close to zero are also consistent with the data.

3.  Random measurement creates more uncertainty about effect sizes.  It has no systematic effect on unstandardized effect sizes, but it systematically lowers standardized effect sizes (correlations, Cohen’s d amount of explained variance).

4.  Selection for significance inflates standardized and unstandardized effect size estimates.  Replication studies may fail if original studies were selected for significance, depending on the amount of bias introduced by selection for significance (this is essentially regression to the mean).

5. As random measurement error attenuates standardized effect sizes,  selection for significance partially corrects for this attenuation.  Applying a correction formula (Spearman) to estimates after selection for significance would produce even more inflated effect size estimates.

6.  The main cause of the replication crisis is undisclosed selection for significance.  Random measurement error has nothing to do with the replication crisis because random measurement error has the same effect on original and replication studies. Thus, it cannot explain why an original study was significant and a replication study failed to be significant.

Questionable Claims in Loken and Gelman’s  Backpack article. 

If you learned that a friend had run a mile in 5 minutes, you would be respectful; if you learned she had done it while carrying a heavy backpack, you would be awed. The obvious inference is that she would have been even faster without the backpack.

This makes sense. We assume that our friends’ ability is a relatively fixed ability,  everybody is slower with a heavy backpack, and the distance is really a mile, the clock was working properly, and no magic potion or tricks are involved.  As a result, we expect very little variability in our friends’ performance and an even faster time without the backpack.

But should the same intuition always be applied to research findings? Should we assume that if statistical significance is achieved in the presence of measurement error, the associated effects would have been stronger without noise?

How do we translate this analogy?  Let’s say running 1 mile in 5 minutes corresponds to statistical significance. Any time below 5 minutes is significant and any time longer than 5 minutes is not significant.  The friends’ ability is the sample size. The lager the sample size, the easier it is to get a significant result.  Finally, the backpack is measurement error.  Just like a heavy backpack makes it harder to run 1 mile in 5 minutes, more measurement error makes it harder to get significance.

The question is whether it follows that the “associated effects” (mean difference or regression coefficient that are used to estimate effect sizes) would have been stronger without random measurement error?

The answer is no.  This may not be obvious, but it directly follows from basic introductory statistics, like the formula for the t-statistic.

t-value  =  mean.difference / SD * sqrt(N)/2

and SD reflects the variability of a construct in the population plus additional variability due to measurement error.  So, measurement error increases the SD component of the t-value, but it has no effect on the effect size.

We caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger. 

With all due respect for trying to make statistics accessible, there is a trade-off between accessibility and sensibility.   First, statistical significance cannot be made stronger. A finding is either significant or it is not.  Surely a test-statistic like a t-value can be (made) stronger or weaker depending on changes in its components.  If we interpret “that which does not kill” as “obtaining a significant result with a lot of random measurement error” it is correct to expect a larger t-value and stronger evidence against the null-hypothesis in a study with a more reliable measure.  This follows directly from the effect of random error on the standard deviation in the denominator of the formula. So how can it be a fallacy to assume something that can be deduced from a mathematical formula? Maybe the authors are not talking about t-values.

It is understandable, then, that many researchers have the intuition that if they manage to achieve statistical significance under noisy conditions, the observed effect would have been even larger in the absence of noise.  As with the runner, they assume that without the burden—that is, uncontrolled variation—their effects would have been even larger.

Although this statement makes it clear that the authors are not talking about t-values, it is not clear why researchers should have the intuition that a study with a more reliable measure should produce larger effect sizes.  As shown above, random measurement error adds to the variability of observations, but it has no systematic effect on the mean difference or regression coefficient.

Now the authors introduce a second source of bias. Unlike random measurement error, this error is systematic and can lead to inflated estimates of effect sizes.

The reasoning about the runner with the backpack fails in noisy research for two reasons. First, researchers typically have so many “researcher degrees of freedom”—unacknowledged choices in how they prepare, analyze, and report their data—that statistical significance is easily found even in the absence of underlying effects and even without multiple hypothesis testing by researchers. In settings with uncontrolled researcher degrees of freedom, the attainment of statistical significance in the presence of noise is not an impressive feat.

The main reason for inferential statistics is to generalize results from a sample to a wider population.  The problem of these inductive inferences is that results in a sample vary from sample to sample. This variation is called sampling error.  Sampling error is separate from measurement error and even studies with perfect measures have sampling error and sampling error is inversely related to sample size (2/sqrt(N)).  Sampling error alone is again unbiased. It can produce larger mean differences or smaller mean differences.  However,  if studies are split into significant studies and non-significant studies,  mean differences of significant results are inflated – and mean differences of non-significant results are deflated estimates of the population mean difference.  So, effect size estimates in studies that are selected for significance are inflated. This is true, even in studies with reliable measures.

In a study with noisy measurements and small or moderate sample size, standard errors will be high and statistically significant estimates will therefore be large, even if the underlying effects are small.

To give an example,  assume there were a height difference of 1 cm between brown eyed and blue eyed individuals.  The standard deviation of height is 10 cm.  A study with 400 participants has a sampling error of 10 / sqrt(400)/2 cm  = 1 cm.  To achieve significance, the effect size has to be about twice as larger as the sampling error (t = 2 ~ p = .05).  Thus, a significant result requires a mean difference of 2 cm, which is 100% larger than the population mean difference in height.

Another researcher uses an unreliable measure (25% reliability) of height that quadruples the variance (100 cm^2 vs. 400 cm^2) and doubles the standard deviation (10cm vs. 20cm).  The sampling error also doubles to 2 cm, and now a mean difference of 4 cm is needed to achieve significance with the same t-value of 2 as in the study with the perfect measure.

The mean difference is two times larger than before and four times larger than the mean difference in the population.

The fallacy would be to look at this difference of 4 cm and to believe that an even larger difference could have been obtained with a more reliable measure.   This is a fallacy, but not for the reasons the authors suggest.  The fallacy is to assume that random measurement error in the measure of height reduced the estimate of 4cm and that an even bigger difference would be obtained with a more reliable measure.  This is a fallacy because random measurement error does not influence the mean difference of 4cm.  Instead,  it increased the standard deviation and with a more reliable measure the standard deviation would be smaller (1 cm) and the mean difference of 4 cm would have a t-value of 4 rather than 2, which is significantly stronger evidence for an effect.

How can the authors overlook that random measurement error has no influence on mean differences?  The reason is that they do not clearly distinguish between standardized and unstandardized estimates of effect sizes.

Spearman famously derived a formula for the attenuation of observed correlations due to unreliable measurement. 

Spearman’s formula applies to correlation coefficients and correlation coefficients are standardized measures of effect sizes because the covariance is divided by the standard deviations of both variables.  Similarly Cohen’s d is a standardized coefficient because the mean difference is divided by the pooled standard deviation of the two groups.

Random measurement error does clearly influence standardized effect size estimates because the standard deviation is used to standardized effect sizes.

The true population mean difference of 1 cm divided by the population standard deviation  of 10 cm yields a Cohen’s d = .10; that is one-tenth of a standard deviation difference.

In the example, the mean difference for a just significant result with a perfect measure was 2 cm, which yields a Cohen’s d = 2 cm divided by 10 cm = .2,  two-tenth of a standard deviation.

The mean difference for a just significant result with a noisy measure was 4 cm, which yields a standardized effect size of 4 cm divided by 20cm = .20, also two-tenth of a standard deviation.

Thus, the inflation of the mean difference is proportional to the increase in the standard deviation.  As a result, the standardized effect size is the same for the perfect measure and the unreliable measure.

Compared to the true mean difference of one-tenth of a standard deviation, the standardized effect sizes are both inflated by the same amount (d = .20 vs. d = .10, 100% inflation).

This example shows the main point the authors are trying to make.  Standardized effect size estimates are attenuated by random measurement error. At the same time, random measurement error increases sampling error and the mean difference has to be inflated to get significance.  This inflation already corrects for the attenuation of standardized effect sizes and any additional corrections for unreliabilty with the Spearman formula would inflate effect size estimates rather than correcting for attenuation.

This would have been a noteworthy observation, but the authors suggest that random measurement error can even have paradox effects on effect size estimates.

But in the small-N setting, this will not hold; the observed correlation can easily be larger in the presence of measurement error (see the figure, middle panel).

This statement is confusing because the most direct effect of measurement error on standardized effect sizes is attenuation.  In the height example, any observed mean difference is divided by 20 rather than 10, reducing the standardized effect sizes by 50%. The variability of these standardized effect sizes is simply a function of sample size and therefore equal.  Thus, it is not clear how a study with more measurement error can produce larger standardized effect sizes.  As demonstrated above, the inflation produced by the significance filter at most compensates for the deflation due to random measurement error.  There is simply no paradox that researchers can obtain stronger evidence (larger t-values or larger standardized effect sizes) with nosier measures even if results are selected for significance.

Our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies. If it really were true that effect sizes were always attenuated by measurement error, then it would be all the more impressive to have achieved significance.

This makes no sense. If random measurement error attenuates effect sizes, it cannot be used to justify surprisingly large mean differences.  Either we are talking about unstandardized effect sizes and they are not influenced by measurement error or we are talking about standardized effect sizes and those are attenuated by measurement error and so obtaining large mean differences is surprising.  If the true mean difference is 1 cm and an effect of 4 cm is needed to get significance with SD = 20 cm, it is surprising to get significance because the power to do so is only 17%.  Of course, it is only surprising if we knew that the population effect size is only 1 cm, but the main point is that we cannot use random measurement error to justify large effect sizes because random measurement error always attenuates standardized effect size estimates.

More confusing claims follow.

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

As explained above, random measurement error makes t-values weaker not stronger. It therefore makes no sense to attribute strong t-values to random measurement error as a potential source of variance.  The most likely explanation for strong effect sizes in studies with large sampling error is selection for significance, not random measurement error.

After all of these confusing claims the authors end with a key point.

A key point for practitioners is that surprising results from small studies should not be defended by saying that they would have been even better with improved measurement.

This is true because it is not a logical argument and not an argument researchers actually make.  The bigger problem is that researchers do not realize that their significance filter makes it necessary to find moderate to large effects and that sampling error in small samples alone can produce these effect sizes, especially when questionable research practices are being used.  No claims about hypothetically larger effect sizes are necessary or regularly made.

Next the authors simply make random statements about significance testing that reveal their ideological bias rather than adding to the understanding of t-values.

It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal to-
noise level is high. 

Of course, the t-value is a measure of the strength of evidence against the null-hypothesis, typically the hypothesis that the data were obtained without a mean difference in the population.  The larger the t-value, the less likely it is that the observed t-value could have been obtained without a population mean difference in the direction of the mean difference in the sample.  And with t-values of 4 or higher, published results also have a high probability of replicating a significant result in a replication study (Open Science Collaboration, 2015).  It can be debated whether a t-value of 2 is weak, moderate or strong evidence, but it is not debatable whether t-values provide information that can be used for inductive inferences.  Even Bayes-Factors rely on t-values.  So, the authors’ criticism of t-values makes little sense from any statistical perspective.

It is also a mistake to assume that the observed effect size would have been even larger if not for the burden of measurement error. Intuitions that are appropriate when measurements are precise are sometimes misapplied in noisy and more
probabilistic settings.

Once more these broad claims are false and misleading.  Everything else equal, estimates of standardized effect sizes are attenuated by random measurement error and would be larger if a more reliable measure had been used.  Once selection for significance is present,  the inflation introduced by selection for significance inflates standardized effect size estimates for perfect measures and it starts to disattenuate standardized effect size estimates with unreliable measures.

In the end, the authors try to link their discussion of random measurement error to the replication crisis.

The consequences for scientific replication are obvious. Many published effects
are overstated and future studies, powered by the expectation that the effects can be
replicated, might be destined to fail before they even begin. We would all run faster
without a backpack on our backs. But when it comes to surprising research findings
from small studies, measurement error (or other uncontrolled variation) should not be
invoked automatically to suggest that effects are even larger.

This is confusing. Replicability is a function of power and power is a function of the population mean difference and the sampling error of the design of a study.  Random measurement error increases sampling error, which reduces standardized effect sizes, power, and replicability.  As a result, studies with unreliable measure are less likely to produce significant results in original studies and in replication studies.

The only reason for surprising replication failures (e.g., 100% significant original studies and 25% significant replication studies for social psychology; OSC, 2015) are questionable practices that inflate the percentage of significant results in original studies.  It is irrelevant whether the original result was produced with a small population mean difference and a reliable measure or with a moderate population mean difference and an unreliable measure.  It only matters how strong the mean difference for the measure that was used is.  That is, replicability is the same for a height difference of 1 cm with a perfect measure and a standard deviation of 10 cm or a height difference of  2 cm and a noisy measure with a standard deviation of 20 cm.  However, the chance of obtaining a significant result in a study if the mean difference is 1 cm and the SD is 20 cm is lower because the noisy measure reduces the standardized effect size to Cohen’s d  = 1 cm / 20 cm = 0.05.

Conclusion

Loken and Gelman wrote a very confusing article about measurement error.  Although confusion about statistics is the norm among social scientists, it is surprising that a statistician has problems to explain basic statistical concepts and how they relate to the outcome of original and replication studies.

The most probable explanation for the confusion is that the authors seem to be believe that the combination of random measurement error and large sampling error creates a novel problem that has been overlooked.

Measurement error and selection bias thus can combine to exacerbate the replication crisis.

In the large-N scenario, adding measurement error will almost always reduce the observed correlation.  Take these scenarios and now add selection on statistical significance… for smaller N, a fraction of the observed effects exceeds the original. 

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

“Of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small”

The quotes suggest that the authors believe something extraordinary is happening in studies with large random measurement error and small samples.  However, this is not the case. Random measurement error attenuates t-values and selection for significance inflates them and these two effects are independent.  There is no evidence to suggest that random measurement error suddenly inflates effect size estimates in small samples with or without selection for significance.

Recommendations for Researchers 

It is also disconcerting that the authors fail to give recommendations how researchers can avoid fallacies, while those recommendations have been made before and would easily fix the problems associated with interpretation of effect sizes in studies with noisy measures and small samples.

The main problem in noisy studies is that point estimates of effect sizes are not a meaningful statistic.   This is not necessarily a problem Many exploratory studies in psychology aim to examine whether there is an effect at all and whether this effect is positive or negative.  A statistically significant result only allows researchers to infer that a positive or negative effect contributed to the outcome of the study (because the extreme t-value falls into a range of values that are unlikely without an effect). So, conclusions should be limited to discussion of the sign of the effect.

Unfortunately, psychologists have misinterpreted Jacob Cohen’s work and started to interpret standardized coefficients like correlation coefficients or Cohen’s d that they observed in their samples.  To make matters worse these coefficients are sometimes called observed effect sizes, as in the article by Loken and Gelman.

This might have been a reasonable term for trained statisticians, but for poorly trained psychologists it suggested that this number tells them something about the magnitude of the effect they were studying.  After all, this seems a reasonable interpretation of the term “observed effect size.”  They then used Cohen’s book to interpret these values as evidence that they obtained a small, moderate, or large effect.  In small studies, the effects have to be moderate (2 groups, n = 20, p = .05 => d = .64) to reach significance.

However, Cohen explicitly warned against this use of effect sizes. He developed standardized effect size measures to help researchers to plan studies that can provide meaningful tests of hypotheses.  A small effect size requires a large sample.  His effect sizes were develop to help researchers to plan studies. If they think an effect is small, they shouldn’t run a study with 40 participants because the study is so noisy that it is likely to fail.  So, standardized effect sizes were intended to be assumptions about unobservable population parameters.

However, psychologists ignored Cohen’s guidelines for the planning of studies. Instead they used his standardized effect sizes to examine how strong the “observed effects” in their studies were.  The misintepretation of Cohen is partially responsible for the replication crisis because researchers ignored the significance filter and were happy to report that they consistently observed moderate to large effect sizes.

However, they also consistently observed replication failures in their labs.  This was puzzling because moderate to large effects should be easy to replicate.  However, without training in statistics, social psychologists found an explanation for this variability of observed effect sizes as well: surely, the variability in observed effect sizes (!) from study to study meant that their results were highly dependent on context.  I still remember joking with some other social psychologists that effects even dependent on the color of research assistants’ shirts.  Only after reading Cohen did I understand what was really happening.  In studies with large sampling error, the “observed effect sizes” move around a lot because they are not observations of effects.  Most of the variation is mean differences from study to study is purely random sampling error.

At the end of his career, Cohen seemed to have lost faith in psychology as a science.  He wrote a dark and sarcastic article titled “The Earth is Round, p < .05.”  In this article, he proposes a simple solution for misinterpretation of “observed effect sizes” in small samples.  The abstract of this article is more informative and valuable than Loken and Gelman’s entire article.

Exploratory data analysis and the use of graphic methods, a steady improvement in
and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical
methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences,

The key point is that any sample statistic like an “effect size estimate” (not an observed effect size) has to be considered in the context of the precision of the estimate.  Nobody would take a public opinion poll seriously if it were conducted with 40 respondents and the result was a 55% chance of a candidate winning an election if this information were provided with the information that the 95%CI  ranges from 40% to 70%.

The same is true for tons of articles that reported effect size estimates without confidence intervals.  For studies with just significant results this is not a problem because significance translates into a confidence interval that does not contain the value specified by a null-hypothesis; typically zero.  For a just significant result this means that the boundary of the CI is close to zero.  So, researchers are justified in interpreting the result as evidence about the sign of an effect, but the effect size is uncertain.  Nobody would rush to buy stocks in a drug company, if they report that their new drug had an effectiveness of extending life expectancy by 1 day up to 3 years.  But if we are mislead in focusing on an observed effect size of 1.5 years, we might be foolish enough to invest in the company and lose some money.

In short, noisy studies with unreliable measures and wide confidence intervals cannot be used to make claims about effect sizes.   The reporting of standardized effect size measures can be useful for meta-analysis or to help future research in the planning of their studies, but researchers should never interpret their point estimates as observed effect sizes.

Final Conclusion

Although mathematics and statistics are fundamental sciences for all quantitative, empirical sciences each scientific discipline has its own history, terminology, and unique challenges.  Political science differs from psychology in many ways.  On the one hand, political science has access to large representative samples because there is a lot of interest in those kind of data and a lot of money is spent on collecting these data.  These data make it possible to obtain relatively precise estimates. The downside is that many data are unique to a historic context. The 2016 election in the United States cannot be replicated.

Psychology is different.  Research budgets and ethics often limit sample sizes.  However, within-subject designs with many repeated measures can increase power, something political scientists cannot do.  In addition, studies in psychology can be replicated because the results are less sensitive to a particular historic context (and yes, there are many replicable findings in psychology that generalize across time and culture).

Gelman knows about as much about psychology as I know about political science. Maybe his article is more useful for political scientists, but psychologists would be better off if they finally recognized the important contribution of one of their own methodologist.

To paraphrase Cohen: Sometimes reading less is more, except for Cohen.

Advertisements

A Quantitative Science Needs to Quantify Validity

Background

This article was published in a special issue in the European Journal of Personality Psychology.   It examines the unresolved issue of validating psychological measures fro the perspective of a multi-method approach (Campbell & Fiske, 1959), using structural equation modeling.

I think it provides a reasonable alternative to the current interest in modeling residual variance in personality questionnaires (network perspective) and solves the problems of manifest personality measures that are confounded by systematic measurement error.

Although latent variable models of multi-method data have been used in structural analyses (Biesanz & West, 2004; deYoung, 2006), these studies have rarely been used to estimate validity of personality measures.  This article shows how this can be done and what assumptions need to be made to interpret latent factors as variance in true personality traits.

Hopefully, sharing this article openly on this blog can generated some discussion about the future of personality measurement in psychology.

=====================================================================

What Multi-Method Data Tell Us About
Construct Validity
ULRICH SCHIMMACK*
University of Toronto Mississauga, Canada

European Journal of Personality
Eur. J. Pers. 24: 241–257 (2010)
DOI: 10.1002/per.771  [for original article]

Abstract

Structural equation modelling of multi-method data has become a popular method to
examine construct validity and to control for random and systematic measurement error in personality measures. I review the essential assumptions underlying causal models of
multi-method data and their implications for estimating the validity of personality
measures. The main conclusions are that causal models of multi-method data can be
used to obtain quantitative estimates of the amount of valid variance in measures of
personality dispositions, but that it is more difficult to determine the validity of personality measures of act frequencies and situation-specific dispositions.

Key words: statistical methods; personality scales and inventories; regression methods;
history of psychology; construct validity; causal modelling; multi-method; measurement

INTRODUCTION

Fifty years ago, Campbell and Fiske (1959) published the groundbreaking article
Convergent and Discriminant Validation by the Multitrait-Multimethod Matrix.With close to 5000 citations (Web of Science, February 1, 2010), it is the most cited article in
Psychological Bulletin. The major contribution of this article was to outline an empirical
procedure for testing the validity of personality measures. It is difficult to overestimate the importance of this contribution because it is impossible to test personality theories
empirically without valid measures of personality.

Despite its high citation count, Campbell and Fiske’s work is often neglected in
introductory textbooks, presumably because validation is considered to be an obscure and complicated process (Borsboom, 2006). Undergraduate students of personality psychology learn little more than the definition of a valid measure as a measure that measures what it is supposed to measure.

However, they are not taught how personality psychologists validate their measures. One might hope that aspiring personality researchers learn about Campbell and Fiske’s multi-method approach during graduate school. Unfortunately, even handbooks dedicated to research methods in personality psychology pay relatively little attention to Campbell and Fiske’s (1959) seminal contribution (John & Soto, 2007; Simms & Watson, 2007). More importantly, construct validity is often introduced in qualitative terms.

In contrast, when Cronbach and Meehl (1955) introduced the concept of construct validity, they proposed a quantitative definition of construct validity as the proportion of construct-related variance in the observed variance of a personality measure. Although the authors noted that it would be difficult to obtain precise estimates of construct validity coefficients (CVCs), they stressed the importance of estimating ‘as definitely as possible the degree of validity the test is presumed to have’ (p. 290).

Campbell and Fiske’s (1959) multi-method approach paved the way to do so. Although Campbell and Fiske’s article examined construct validity qualitatively, subsequent developments in psychometrics allowed researchers to obtain quantitative estimates of construct validity based on causal models of multi-method data (Eid, Lischetzke, Nussbeck, & Trierweiler, 2003; Kenny & Kashy, 1992). Research articles in leading personality journals routinely report these estimates (Biesanz & West, 2004; DeYoung, 2006; Diener, Smith, & Fujita, 1995), but a systematic and accessible introduction to causal models of multi-method data is lacking.

The main purpose of this paper is to explain how causal models of multi-method data can be used to obtain quantitative estimates of construct validity and which assumptions these models make to yield accurate estimates.

I prefer the term causal model to the more commonly used term structural equation model because I interpret latent variables in these models as unobserved, yet real causal forces that produce variation in observed measures (Borsboom, Mellenbergh,&
van Heerden, 2003). I make the case below that this realistic interpretation of latent factors is necessary to use multi-method data for construct validation research because the assumption of causality is crucial for the identification of latent variables with construct variance (CV).

Campbell and Fiske (1959) distinguished absolute and relative (construct) validity. To
examine relative construct validity it is necessary to measure multiple traits and to look for evidence of convergent and discriminant validity in a multi-trait-multi-method matrix (Simms &Watson, 2007). However, to examine construct validity in an absolute sense, it is only necessary to measure one construct with multiple methods.

In this paper, I focus on convergent validity across multiple measures of a single construct because causal models of multi-method data rely on convergent validity alone to examine construct validity.

As discussed in more detail below, causal models of multi-method data estimate
construct validity quantitatively with the factor loadings of observed personality measures on a latent factor (i.e. an unobserved variable) that represents the valid variance of a construct. The amount of valid variance in a personality measure can be obtained by squaring its factor loading on this latent factor. In this paper, I use the terms construct validity coefficient (CVC) to refer to the factor loading and the term construct variance (CV) for the amount of valid variance in a personality measure.

Validity

A measure is valid if it measures what it was designed to measure. For example, a
thermometer is a valid measure of temperature in part because the recorded values covary with humans’ sensory perceptions of temperature (Cronbach & Meehl, 1955). A modern thermometer is a more valid measure of temperature than humans’ sensory perceptions, but the correlation between scores on a thermometer and humans’ sensory perceptions is necessary to demonstrate that a thermometer measures temperature. It would be odd to claim that highly reliable scores recorded by an expensive and complicated instrument measure temperature if these scores were unrelated to humans’ everyday perceptions of temperature.

The definition of validity as a property of a measure has important implications for
empirical tests of validity. Namely, researchers first need a clearly defined construct before they can validate a potential measure of the construct. For example, to evaluate a measure of anxiety researchers first need to define anxiety and then examine the validity of a measure as a measure of anxiety. Although the importance of clear definitions for construct validation research may seem obvious, validation research often seems to work in the opposite direction; that is, after a measure has been created psychologists examine what it measures.

For example, the widely used Positive Affect and Negative Affect Schedule (PANAS) has two scales named Positive Affect (PA) and Negative Affect (NA). These scales are based on exploratory factor analyses of mood ratings (Watson, Clark, & Tellegen, 1988). As a result, Positive Affect and Negative Affect are merely labels for the first two VARIMAX rotated principal components that emerged in these analyses. Thus, it is meaningless to examine whether the PANAS scales are valid measures of PA and NA. They are valid measures of PA and NA by definition because PA and NA are mere labels of the two VARIMAX rotated principal components that emerge in factor analyses of mood ratings.

A construct validation study would have to start with an a priori definition of Positive Affect and Negative Affect that does not refer to the specific measurement procedure that was used to create the PANAS scales. For example, some researchers have
defined Positive Affect and Negative Affect as the valence of affective experiences and
have pointed out problems of the PANAS scales as measures of pleasant and unpleasant
affective experiences (see Schimmack, 2007, for a review).

However, the authors of the PANAS do not view their measure as a measure of hedonic valence. To clarify their position, they proposed to change the labels of their scales from Positive Affect and Negative Affect to Positive Activation and Negative Activation (Watson,Wiese, Vaidya, & Tellegen, 1999). The willingness to change labels indicates that PANAS scales do not measure a priori defined constructs and as a result there is no criterion to evaluate the construct validity of the PANAS scales.

The previous example illustrates how personality measures assume a life of their own
and implicitly become the construct; that is, a construct is operationally defined by the
method that is used to measure it (Borsboom, 2006). A main contribution of Cambpell and Fiske’s (1959) article was to argue forcefully against operationalism and for a separation of constructs and methods. This separation is essential for validation research because validation research has to allow for the possibility that some of the observed variance is invalid.

Other sciences clearly follow this approach. For example, physics has clearly defined
concepts such as time or temperature. Over the past centuries, physicists have developed
increasingly precise ways of measuring these concepts, but the concepts have remained the same. Modern physics would be impossible without these advances in measurement.
However, psychologists do not follow this model of more advanced sciences. Typically, a
measure becomes popular and after it becomes popular it is equated with the construct. As a result, researchers continue to use old measure and rarely attempt to create better
measures of the same construct. Indeed, it is hard to find an example, in which one measure of a construct has replaced another measure of the same construct based on an empirical comparison of the construct validity of competing measures of the same construct (Grucza & Goldberg, 2007).

One reason for the lack of progress in the measurement of personality constructs could
be the belief that it is impossible to quantify the validity of a measure. If it were impossible to quantify the validity of a measure, then it also would be impossible to say which of two measures is more valid. However, causal models of multi-method data produce quantitative estimates of validity that allow comparisons of the validity of different measures.

One potential obstacle for construct validation research is the need to define
psychological constructs a priori without reference to empirical data. This can be difficult for constructs that make reference to cognitive processes (e.g. working memory capacity) or unconscious motives (implicit need for power). However, the need for a priori definitions is not a major problem in personality psychology. The reason is that everyday language provides thousands of relatively well-defined personality constructs (Allport & Odbert, 1936). In fact, all measures in personality psychology that are based on the lexical hypothesis assume that everyday concepts such as helpful or sociable are meaningful personality constructs. At least with regard to these relatively simple constructs, it is possible to test the construct validity of personality measures. For example, it is possible to examine whether a sociability scale really measures sociability and whether a measure of helpfulness really measures helpfulness.

Convergent validity

I start with a simple example to illustrate how psychologists can evaluate the validity of a
personality measure. The concept is people’s weight.Weight can be defined as ‘the vertical force exerted by a mass as a result of gravity’ (wordnet.princeton.edu). In the present case, only the mass of human adults is of interest. The main question, which has real practical significance in health psychology (Kroh, 2005), is to examine the validity of self-report measures of weight because it is more economical to use self-reports than to weigh people with scales.

To examine the validity of self-reported weight as a measure of actual weight, it is
possible to obtain self-reports of weight and an objective measure of weight from the same individuals. If self-reports of weight are valid, they should be highly correlated with the objective measure of weight. In one study, participants first reported their weight before their weight was objectively measured with a scale several weeks later (Rowland, 1990). The correlation in this study was r (N =11,284) =.98. The implications of this finding for the validity of self-reports of weight depend on the causal processes that underlie this correlation, which can be examined by means of causal modelling of correlational data.

It is well known that a simple correlation does not reveal the underlying causal process,
but that some causal process must explain why a correlation was observed (Chaplin, 2007). Broadly speaking, a correlation is determined by the strength of four causal effects, namely, the effect of observed variable A on observed variable B, the effect of observed variable B on observed variable A, and the effects of an unobserved variable C on observed variable A and on observed variable B.

In the present example, the observed variables are the self-reported weights and those recorded by a scale. To make inferences about the validity of self-reports of weight it is necessary to make assumptions about the causal processes that produce a correlation between these two methods. Fortunately, it is relatively easy to do so in this example. First, it is fairly certain that the values recorded by a scale are not influenced by individuals’ self-reports. No matter how much individuals insist that the scale is wrong, it will not change its score. Thus, it is clear that the causal effect of self-reports on
the objective measure is zero. It is also clear that self-reports of weight in this study were
not influenced by the objective measurement of weight in this study because self-reports
were obtained weeks before the actual weight was measured. Thus, the causal effect of the objectively recorded scores on self-rating is also zero. It follows that the correlation of r =.98 must have been produced by a causal effect of an unobserved third variable. A
plausible third variable is individuals’ actual mass. It is their actual mass that causes the
scale to record a higher or lower value and their actual mass also caused them to report a specific weight. The latter causal effect is probably mediated by prior objective
measurements with other scales, and the validity of these scales would influence the
validity of self-reports among other factors (e.g. socially desirable responding). In combination, the causal effects of actual mass on self-reports and on the scale produce the observed correlation of r =.98. This correlation is not sufficient to determine how strong the effects of weight on the two measures are. It is possible that the scale was a perfect measure of weight. In this case, the correlation between weight and the values recorded by the scale is 1. It follows, that the size of the effect of weight on self-reports of weight (or the factor loading of self-reported weight on the weight factor) has to be r =.98 to produce the observed correlation of r =.98 (1 *.98 = .98). In this case, the CVC of the self-report measure of weight would be .98. However, it is also possible that the scale is a slightly imperfect measure of weight. For example, participants may not have removed their shoes before stepping on the scale and differences in the weight of shoes (e.g. boots versus sandals) could have produced measurement error in the objective measure of individuals’ true weight. It is also possible that changes in weight over time reduce the validity of objective scores as a validation criterion for self-ratings several weeks earlier. In this case, the estimate underestimates the validity of self-ratings.

In the present context, the reasons for the lack of perfect convergent validity are irrelevant. The main point of this example was to illustrate how the correlation between two independent measures of the same construct can be used to obtain quantitative estimates of the validity of a personality measure. In this example, a conservative estimate of the CVC of self-reported weight as a measure of weight is .98 and the estimated amount of CVin the self-report measure is 96% (.98^2 = .96).
The example of self-reported weight was used to establish four important points about
construct validity. First, the example shows that convergent validity is sufficient to examine construct validity. The question of how self-reports of weight are related to measures of other constructs (e.g. height, social desirable responding) can be useful to examine sources of measurement error, but correlations with measures of other constructs are not needed to estimate CVCs. Second, empirical tests of construct validity do not have to be an endless process without clear results (Borsboom, 2006). At least for some self-report measures it is possible to provide a meaningful answer to the question of their validity. Third, validity is a quantitative construct. Qualitative conclusions that a measure is valid because validity is not zero (CVC>0, p<.05) or that a measure is invalid because validity is not perfect (CVC<1.0, p<.05) are not very helpful because most measures are valid and invalid (0<CVC<1). As a result, qualitative reviews of validity studies are often the source of fruitless controversies (Schimmack & Oishi, 2005). The validity of personality measures should be estimated quantitatively like other psychometric properties such as reliability coefficients, which are routinely reported in research articles (Schmidt & Hunter, 1996).

Validity is more important than reliability because reliable and invalid measures are
potentially more dangerous than unreliable measures (Blanton & Jaccard, 2006). Moreover, it is possible that a less reliable measure is more valid than a more reliable measure if the latter measure is more strongly contaminated by systematic measurement error (John & Soto, 2007). A likely explanation for the emphasis on reliability is the common tendency to equate constructs with measures. If a construct is equated with a measure, only random error can undermine the validity of a measure. The main contribution of Campbell and Fiske (1959) was to point out that systematic measurement error can also threaten the validity of personality measures. As a result, high reliability is insufficient evidence for the validity of a personality measure (Borsboom & Mellenbergh, 2002).

The third point illustrated in this example is that tests of convergent validity require
independent measures. Campbell and Fiske (1959) emphasized the importance of
independent measures when they defined convergent validity as the correlation between ‘maximally different methods’ (p. 83). In a causal model of multi-method data the independence assumption implies that the only causal effects that produce a correlation between two measures of the same construct are the causal effect of the construct on the two measures. This assumption implies that all the other potential causal effects that can produce correlations among observed measures have an effect size of zero. If this assumption is correct, the shared variance across independent methods represents CV. It is then possible to estimate the proportion of the shared variance relative to the total observed variance of a personality measure as an estimate of the amount of CV in this measure. For example, in the previous example I assumed that actual mass was the only causal force that contributed to the correlation between self-reports of weight and objective scale scores. This assumption would be violated if self-ratings were based on previous measurements with objective scales (which is likely) and objective scales share method variance that does not reflect actual weight (which is unlikely). Thus, even validation studies with objective measures implicitly make assumptions about the causal model underling these correlations.

In sum, the weight example illustrated how a causal model of the convergent validity
between two measures of the same construct can be used to obtain quantitative estimates of the construct validity of a self-report measure of a personality characteristic. The following example shows how the same approach can be used to examine the construct validity of measures that aim to assess personality traits without the help of an objective measure that relies on well-established measurement procedures for physical characteristics like weight.

CONVERGENT VALIDITY OF PERSONALITY MEASURES

A Hypothetical Example

I use helpfulness as an example. Helpfulness is relatively easy to define as ‘providing
assistance or serving a useful function’ (wordnetweb.princeton.edu/perl/webwn). Helpful can be used to describe a single act or an individual. If helpful is used to describe a single act, helpful is not only a characteristic of a person because helping behaviour is also influenced by situational factors and interactions between personality and situational factors. Thus, it is still necessary to provide a clearer definition of helpfulness as a personality characteristic before it is possible to examine the validity of a personality measure of helpfulness.

Personality psychologists use trait concepts like helpful in two different ways. The most
common approach is to define helpful as an internal disposition. This definition implies
causality. There are some causal factors within an individual that make it more likely for
this individual to act in a helpful manner than other individuals. The alternative approach is to define helpfulness as the frequency with which individuals act in a helpful manner. An individual is helpful if he or she acted in a helpful manner more often than other people. This approach is known as the act frequency approach. The broader theoretical differences between these two approaches are well known and have been discussed elsewhere (Block, 1989; Funder, 1991; McCrae & Costa, 1995). However, the implications of these two definitions of personality traits for the interpretation of multi-method data have not been discussed. Ironically, it is easier to examine the validity of personality measures that aim to assess internal dispositions that are not directly observable than to do so for personality measures that aim to assess frequencies of observable acts. This is ironic because intuitively it seems to be easier to count the frequency of observable acts than to measure unobservable internal dispositions. In fact, not too long ago some psychologists doubted that internal dispositions even exist (cf. Goldberg, 1992).

The measurement problem of the act frequency approach is that it is quite difficult to
observe individuals’ actual behaviours in the real world. For example, it is no trivial task to establish how often John was helpful in the past month. In comparison it is relatively easy to use correlations among multiple imperfect measures of observable behaviours to make inferences about the influence of unobserved internal dispositions on behaviour.

Figure 1. Theoretical model of multi-method data. Note. T = trait (general disposition); AF-c, AF-f, AF-s  = act frequencies with colleague, friend and spouse; S-c, S-f, S-s =situational and person x situation interaction effects on act frequencies; R-c, R-f, R-s = reports by colleague, friend and spouse; E-c, E-f, E-s =errors in reports by
colleague, friend and spouse.

Figure 1 illustrates how a causal model of multi-method data can be used for this purpose. In Figure 1, an unobserved general disposition to be helpful influences three observed measures of helpfulness. In this example, the three observed measures are informant ratings of helpfulness by a friend, a co-worker and a spouse. Unlike actual informant ratings in personality research, informants in this hypothetical example are only asked to report how often the target helped them in the past month. According to Figure 1, each informant report is influenced by two independent factors, namely, the actual frequency of helpful acts towards the informant and (systematic and random) measurement error in the reported frequencies of helpful acts towards the informant. The actual frequency of helpful acts is also influenced by two independent factors. One factor represents the general disposition to be helpful that influences helpful behaviours across situations. The other factor represents situational factors and person-situation interaction effects. To fully estimate all coefficients in this model (i.e. effect sizes of the postulated causal effects), it would be necessary to separate measurement error and valid variance in act frequencies.

This is impossible if, as in Figure 1, each act frequency is measured with a single method,
namely, one informant report. In contrast, the influence of the general disposition is
reflected in all three informant reports. As a result, it is possible to separate the variance due to the general disposition from all other variance components such as random error,
systematic rating biases, situation effects and personsituation interaction effects. It is
then possible to determine the validity of informant ratings as measures of the general
disposition, but it is impossible to (precisely) estimate the validity of informant ratings as
measures of act frequencies because the model cannot distinguish reporting errors from
situational influences on helping behaviour.

The causal model in Figure 1 makes numerous independence assumptions that specify
Campbell and Fiske’s (1959) requirement that traits should be assessed with independent
methods. First, the model assumes that biases in ratings by one rater are independent of
biases in ratings by other raters. Second, it assumes that situational factors and
person by situation interaction effects that influence helping one informant are independent of the situational and personsituation factors that influence helping other informants. Third, it assumes that rating biases are independent of situation and person by situation interaction effects for the same rater and across raters. Finally, it assumes that rating biases and situation effects are independent of the global disposition. In total, this amounts to 21 independence assumptions (i.e. Figure 1 includes seven exogeneous variables, that is, variables that do not have an arrow pointing at them, which implies 21 (7×6/2) relationships that the model assumes to be zero). If these independence assumptions are correct, the correlations among the three informant ratings can be used to determine the variation in the unobserved personality disposition to be helpful with perfect validity. This variance can then be used like the objective measure of weight in the previous example as the validation criterion for personality measures of the general
disposition to be helpful (e.g. self-ratings of general helpfulness). In sum, Figure 1
illustrates that a specific pattern of correlations among independent measures of the same construct can be used to obtain precise estimates of the amount of valid variance in a single measure.

The main challenge for actual empirical studies is to ensure that the methods in a multi-method model fulfill the independence assumptions. The following examples demonstrate the importance of the neglected independence assumption for the correct interpretation of causal models of multi-method data. I also show how researchers can partially test the independence assumption if sufficient methods are available and how researchers can estimate the validity of personality measures that aggregate scores from independent methods. Before I proceed, I should clarify that strict independence of methods is unlikely, just like other null-hypotheses are likely to be false. However, small violations of the independence assumption will only introduce small biases in estimates of CVCs.

Example 1: Multiple response formats

The first example is a widely cited study of the relation between Positive Affect and
Negative Affect (Green, Goldman,&Salovey, 1993). I chose this paper because the authors
emphasized the importance of a multi-method approach for the measurement of affect,
while neglecting Campbell and Fiske’s requirement that the methods should be maximally different. A major problem for any empirical multi-method study is to find multiple independent measures of the same construct. The authors used four self-report measures with different response formats for this purpose. However, the variation of response formats can only be considered a multi-method study, if one assumes that responses on one response format are independent of responses on the other response formats so that correlations across response formats can only be explained by a common causal effect of actual momentary affective experiences on each response format. However, the validity of all self-report measures depends on the ability and willingness of respondents to report their experiences accurately. Violations of this basic assumption introduce shared method variance among self-ratings on different response formats. For example, socially desirable responding can inflate ratings of positive experiences across response formats. Thus, Green et al.’s (1993) study assumed rather than tested the validity of self-ratings of momentary affective experiences. At best, their study was able to examine the contribution of stylistic tendencies in the use of specific response formats to variance in mood ratings, but these effects are known to be small (Schimmack, Bockenholt, & Reisenzein, 2002). In sum, Green et al.’s (1993) article illustrates the importance of critically examining the similarity of methods in a multi-method study. Studies that use multiple self-report measures that vary response formats, scales, or measurement occasions should not be considered multi-method studies that can be used to examine construct validity.

Example 2: Three different measures

The second example of a multi-method study also examined the relation between Positive Affect and Negative Affect (Diener et al., 1995). However, it differs from the previous example in two important ways. First, the authors used more dissimilar methods that are less likely to violate the independence assumption, namely, self-report of affect in the past month, averaged daily affect ratings over a 6 week period and averaged ratings of general affect by multiple informants. Although these are different methods, it is possible that these methods are not strictly independent. For example, Diener et al. (1995) acknowledge that all three measures could be influenced by impression management. That is, retrospective and daily self-ratings could be influenced by social desirable responding, and informant ratings could be influenced by targets’ motivation to hide negative emotions from others. A common influence of impression management on all three methods would inflate validity estimates of all three methods.

For this paper, I used Diener et al.’s (1995) multi-method data to estimate CVCs for the
three methods as measures of general dispositions that influence people’s positive and
negative affective experiences. I used the data from Diener et al.’s (1995) Table 15 that are reproduced in Table 1. I used MPLUS5.1 for these analyses and all subsequent analyses (Muthen & Muthen, 2008). I fitted a simple model with a single latent variable that represents a general disposition that has causal effects on the three measures. Model fit was perfect because a model with three variables and three parameters has zero degrees of freedom and can perfectly reproduce the observed pattern of correlations. The perfect fit implies that CVC estimates are unbiased if the model assumptions are correct, but it also implies that the data are unable to test model assumptions.
These results suggest impressive validity of self-ratings of affect (Table 2). In contrast,
CVC estimates of informant ratings are considerably lower, despite the fact that informant ratings are based on averages of several informants. The non-overlapping confidence intervals for self-ratings and informant ratings indicate that this difference is statistically significant. There are two interpretations of this pattern. On the one hand, it is possible that informants are less knowledgeable about targets’ affective experiences. After all, they do not have access to information that is only available introspectively. However, this privileged information does not guarantee that self-ratings are more valid because individuals only have privileged information about their momentary feelings in specific situations rather than the internal dispositions that influence these feelings. On the other hand, it is possible that retrospective and daily self-ratings share method variance and do not fulfill the independence assumption. In this case, the causal model would provide inflated estimates of the validity of self-ratings because it assumes that stronger correlations between retrospective and daily self-ratings reveal higher validity of these methods, when in reality the higher correlation is caused by shared method effects. A study with three methods is unable to test these alternative explanations.

Example 3: Informants as multiple methods

One limitation of Diener et al.’s (1995) study was the aggregation of informant ratings.
Although aggregated informant ratings provide more valid information than ratings by a
single informant, the aggregation of informant ratings destroys valuable information about the correlations among informant ratings. The example in Figure 1 illustrated that ratings by multiple informants provide one of the easiest ways to measure dispositions with multiple methods because informants are more likely to base their ratings on different situations, which is necessary to reveal the influence of internal dispositions.

Example 3 shows how ratings by multiple informants can be used in construct validation research. The data for this example are based on multi-method data from the Riverside Accuracy Project (Funder, 1995; Schimmack, Oishi, Furr, & Funder, 2004). To make the CVC estimates comparable to those based on the previous example, I used scores on the depression and cheerfulness facets of the NEO-PI-R (Costa&McCrae, 1992). These facets are designed to measure affective dispositions. The multi-method model used self-ratings and informant ratings by parents, college friends and hometown friends as different methods.

Table 3 shows the correlation matrices for cheerfulness and depression. I first fitted a causal model that assumed independence of all methods to the data. The model also included sum scores of observed measures to examine the validity of aggregated informant ratings and an aggregated measure of all four raters (Figure 2). Model fit was evaluated using standard criteria of model fit, namely, comparative fit index (CFI)>.95, root mean square error of approximation (RMSEA)<.06 and standardized
root mean residuals (SRMR)<.08.

Neither cheerfulness, chi2 (df =2, N =222) = 11.30, p<.01, CFI =.860, RMSEA =.182, SRMR = .066, nor depression, chi2 (df =2, N = 222) = 8.31, p =.02,  CFI =. 915, RMSEA = .150, SRMR =.052, had acceptable CFI and RSMEA values.

One possible explanation for this finding is that self-ratings are not independent of informant ratings because self-ratings and informant ratings could be partially based on overlapping situations. For example, self-ratings of cheerfulness could be heavily influenced by the same situations that are also used by college friends to rate cheerfulness (e.g. parties). In this case, some of the agreement between self-ratings and informant ratings by college friends would reflect the specific situational factors of
overlapping situations, which leads to shared variance between these ratings that does not reflect the general disposition. In contrast, it is more likely that informant ratings are independent of each other because informants are less likely to rely on the same situations (Funder, 1995). For example, college friends may rely on different situations than parents.

To examine this possibility, I fitted a model that included additional relations between  self-ratings and informant ratings (dotted lines in Figure 2). For cheerfulness, an additional relation between self-ratings and ratings by college friends was sufficient to achieve acceptable model fit, chi2 (df =1, N =222) =0.08, p =.78, CFI =1.00, RMSEA =.000,
SRMR =.005. For depression, additional relations of self-ratings to ratings by college
friends and parents were necessary to achieve acceptable model fit. Model fit of this model was perfect because it has zero degrees of freedom. In these models, CVC can no longer be estimated by factor loadings alone because some of the valid variance in self-ratings is also shared with informant ratings. In this case, CVC estimates represent the combined total effect of the direct effect of the latent disposition factor on self-ratings and the indirect effects that are mediated by informant ratings.

I used the model indirect option of MPLUS5.1 to estimate the total effects in a model that  also included sum scores with equal weights for the three informant ratings and all four ratings.  Table 4 lists the CVC estimates for the four ratings and the two measures based on aggregated ratings.

The CVC estimates of self-ratings are considerably lower than those based on Diener
et al.’s (1995) data. Moreover, the results suggest that in this study aggregated informant
ratings are more valid than self-ratings, although the confidence intervals overlap. The
results for the aggregated measure of all four raters show that adding self-ratings to
informant ratings did not increase validity above and beyond the validity obtained by
aggregating informant ratings.

These results should not be taken too seriously because they are based on a single,
relatively small sample. Moreover, it is important to emphasize that these CVC estimates
depend on the assumption that informant ratings do not share method variance. Violation of this assumption would lead to an underestimation of the validity of self-ratings. For example, an alternative assumption would be that personality changes. As a result, parent ratings and ratings by hometown friends may share variance because they are based in part on situations before personality changed, whereas college friends’ ratings are based on more recent situations. This model fits the data equally well and leads to much higher estimates of CV in self-ratings. To test these competing models it would be necessary to include additional measures. For example, standardized laboratory tasks and biological measures could be added to the design to separate valid variance from shared rating biases by informants.

These inconsistent findings might suggest that it is futile to obtain wildly divergent quantitative estimates of construct validity. However, the same problem arises in other research areas and it can be addressed by designing better studies that test assumptions that cannot be tested in existing data sets. In fact, I believe that publication of conflicting validity estimates will stimulate research on construct validity, whereas the view of construct validation research as an obscure process without clear results has obscured the lack of knowledge about the validity of personality measures.

IMPLICATIONS

I used two multi-method datasets to illustrate how causal models of multi-method data can be used to estimate the validity of personality measures. The studies produced different results. It is not the purpose of this paper to examine the sources of disagreement. The results merely show that it is difficult to make general claims about the validity of commonly used personality measures. Until more precise information becomes available, the results suggest that about 30–70% of the variance in self-ratings and single informant ratings is CV. Until more precise estimates become available I suggest an estimate of 50 +/- 20% as a rough estimate of construct validity of personality ratings.

I suggest the verbal labels low validity for measures with less than 30% CV (e.g. implicit measures of well-being, Walker & Schimmack, 2008), moderate validity for measures with 30–70% CV (most self-report measures of personality traits) and high validity for measures with more than 70% CV (self-ratings of height and weight). Subsequently, I briefly discuss the practical implications of using self-report measures with moderate validity to study the causes and consequences of personality dispositions.

Correction for invalidity

Measurement error is nearly unavoidable, especially in the measurement of complex
constructs such as personality dispositions. Schmidt and Hunter (1996) provided
26 examples of how the failure to correct for measurement error can bias substantive
conclusions. One limitation of their important article was the focus on random
measurement error. The main reason is probably that information about random
measurement error is readily available. However, invalid variance due to systematic
measurement error is another factor that can distort research findings. Moreover, given
the moderate amount of valid variance in personality measures, corrections for invalidity are likely to have more dramatic practical implications than corrections for unreliability. The following examples illustrate this point.

Hundreds of twin studies have examined the similarity between MZ and DZ twins to
examine the heritability of personality characteristics. A common finding in these studies are moderate to large MZ correlations (r =.3–.5) and small to moderate (r =.1–.3) DZ correlations. This finding has led to the conclusion that approximately 40% of the variance is heritable and 60% of the variance is caused by environmental factors. However, this interpretation of twin data fails to take measurement error into account. As it turns out, MZ correlations approach, if not exceed, the amount of validity variance in personality measures as estimated by multi-method data. In other words, ratings by two different individuals of two different individuals (self-ratings by MZ twins) tend to correlate as highly with each other as those of a single individual (self ratings and informant ratings of a single target). This finding suggests that heritability estimates based on mono-method studies severely underestimate heritability of personality dispositions (Riemann, Angleitner, & Strelau, 1997). A correction for invalidity would suggest that most of the valid variance is heritable (Lykken&Tellegen, 1996). However, it is problematic to apply a direct correction for invalidity to twin data because this correction assumes that the independence assumption is valid. It is better to combine a multi-method assessment with a twin design (Riemann et al., 1997). It is also important to realize that multi-method models focus on internal dispositions rather than act frequencies. It makes sense that heritability estimates of internal dispositions are higher than heritability estimates of act frequencies because act frequencies are also influenced by situational factors.

Stability of personality dispositions

The study of stability of personality has a long history in personality psychology (Conley,
1984). However, empirical conclusions about the actual stability of personality are
hampered by the lack of good data. Most studies have relied on self-report data to examine this question. Given the moderate validity of self-ratings, it is likely that studies based on self-ratings underestimate true stability of personality. Even corrections for unreliability alone are sufficient to achieve impressive stability estimates of r =.98 over a 1-year interval (Anusic & Schimmack, 2016; Conley, 1984). The evidence for stability of personality from multi-method studies is even more impressive. For example, one study reported a retest correlation of r =.46 over a 26-year interval for a self-report measure of neuroticism (Conley, 1985). It seems possible that personality could change considerably over such a long time period. However, the study also included informant ratings of personality. Self-informant agreement on the same occasion was also r =.46. Under the assumption that self-ratings and informant ratings are independent methods and that there is no stability in method variance, this pattern of correlations would imply that variation in neuroticism did not change at all over this 26-year period (.46/.46 =1.00). However, this conclusion rests on the validity of the assumption that method variance is not stable. Given the availability of longitudinal multi-method data it is possible to test this assumption. The relevant information is contained in the cross-informant, cross-occasion correlations. If method  variance was unstable, these correlations should also be r =.46. In contrast, the actual correlations are lower, r =.32. This finding indicates that (a) personality dispositions changed and (b) there is some stability in the method variance. However, the actual stability of personality dispositions is still considerably higher (r =.32/.46 =.70) than one would have inferred from the observed retest correlation r =.46 of self-ratings alone. A retest correlation of r =.70 over a 26-year interval is consistent with other estimates that the stability of personality dispositions is about r =.90 over a 10-year period and r =.98 over a 1-year period (Conley, 1984; Terracciano, Costa, & McCrae, 2006) and that the majority of the variance is due to stable traits that never change (Anusic & Schimmack, 2016). The failure to realize
that observed retest correlations underestimate stability of personality dispositions can be costly because it gives personality researchers a false impression about the likelihood of finding empirical evidence for personality change. Given the true stability of personality it is necessary to wait a long time or to use large sample sizes and probably best to do both (Mroczek, 2007).

Prediction of behaviour and life outcomes

During the person-situation debate, it was proposed that a single personality trait predicts less than 10% of the variance in actual behaviours. However, most of these studies relied on self-ratings of personality to measure personality. Given the moderate validity of self-ratings, the observed correlation severely underestimates the actual effect of personality traits on behaviour. For example, a recent meta-analysis reported an effect size of conscientiousness on GPA of r =.24 (Noftle & Robins, 2007). Ozer (2007) points out
that strictly speaking the correlation between self-reported conscientiousness and GPA
does not represent the magnitude of a causal effect.

Assuming 40% valid variance in self-report measures of conscientiousness (DeYoung, 2006), the true effect size of a conscientious disposition on GPA is r =.38 (.24/sqrt(.40)). As a result, the amount of explained variance in GPA increases from 6% to 14%. Once more, failure to correct for invalidity in personality measures can be costly. For example, a personality researcher might identify seven causal factors that independently produce observed effect size estimates of r =.24, which suggests that these seven factors explain less than 50% of the variance in GPA (7 * .24^2 =42%). However, decades of future research are unable to uncover additional predictors of GPA. The reason could be that the true amount of explained variance is nearly 100% and that the unexplained variance is due to invalid variance in personality measures (7 * .38^2 =100%).

CONCLUSION

This paper provided an introduction to the logic of a multi-method study of construct
validity. I showed how causal models of multi-method data can be used to obtain
quantitative estimates of the construct validity of personality measures. I showed that
accurate estimates of construct validity depend on the validity of the assumptions
underlying a causal model of multi-method data such as the assumption that methods are independent. I also showed that multi-method studies of construct validity require
postulating a causal construct that can influence and produce covariances among
independent methods. Multi-method studies for other constructs such as actual behaviours or act frequencies are more problematic because act frequencies do not predict a specific pattern of correlations across methods. Finally, I presented some preliminary evidence that commonly used self-ratings of personality are likely to have a moderate amount of valid variance that falls broadly in a range from 30% to 70% of the total variance. This estimate is consistent with meta-analyses of self-informant agreement (Connolly, Kavanagh, & Viswesvaran, 2007; Schneider & Schimmack, 2009). However, the existing evidence is limited and more rigorous tests of construct validity are needed. Moreover studies with large, representative samples are needed to obtain more precise estimates of construct validity (Zou, Schimmack, & Gere, 2013). Hopefully, this paper will stimulate more research in this fundamental area of personality psychology by challenging the description of construct validity research as a Kafkaesque pursuit of an elusive goal that can never be reached (cf. Borsboom, 2006). Instead empirical studies of construct validity are a viable and important scientific enterprise that faces the same challenges as other studies in personality psychology that try
to make sense of correlational data.

REFERENCES

Allport, G. W., & Odbert, H. S. (1936). Trait-names a psycho-lexical study. Psychological
Monographs, 47(1), 1–171.

Anusic, I. & Schimmack, U. (2016). Stability and change of personality traits, self-esteem, and well-being: Introducing the meta-analytic stability and change model of retest correlations. Journal of Personality and Social Psychology, Vol 110(5), May 2016, 766-781. 

Biesanz, J. C., &West, S. G. (2004). Towards understanding assessments of the Big Five: Multitraitmultimethod analyses of convergent and discriminant validity across measurement occasion and type of observer. Journal of Personality, 72(4), 845–876.

Blanton, H., & Jaccard, J. (2006). Arbitrary metrics redux. American Psychologist, 61(1), 62–71.

Block, J. (1989). Critique of the act frequency approach to personality. Journal of Personality and Social Psychology, 56(2), 234–245.

Borsboom, D. (2006). The attack of the psychometricians. Psychometrika, 71(3), 425–440.

Borsboom, D., &Mellenbergh, G. J. (2002). True scores, latent variables, and constructs: A comment on Schmidt and Hunter. Intelligence, 30(6), 505–514.

Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2003). The theoretical status of latent variables. Psychological Review, 110(2), 203–219.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56(2), 81–105.

Chaplin,W. F. (2007). Moderator and mediator models in personality research: A basic introduction. In R.W. Robins, C. R. Fraley,&R. F. Krueger (Eds.), Handbook of research methods in personality psychology (602–632). New York, NY: Guilford Press.

Conley, J. J. (1984). The hierarchy of consistency: A review and model of longitudinal findings on adult individual differences in intelligence, personality and self-opinion. Personality and Individual Differences, 5(1), 11–25.

Conley, J. J. (1985). Longitudinal stability of personality traits: A multitrait-multimethod-multioccasion analysis. Journal of Personality and Social Psychology, 49(5), 1266–1282.

Connolly, J. J., Kavanagh, E. J., & Viswesvaran, C. (2007). The convergent validity between self and observer ratings of personality: A meta-analytic review. International Journal of Selection and Assessment, 15(1), 110–117.

Costa, J. P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEOPI-R) and Five Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological Assessment Resources.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.

DeYoung, C. G. (2006). Higher-order factors of the Big Five in a multi-informant sample. Journal of Personality and Social Psychology, 91(6), 1138–1151.

Diener, E., Smith, H., & Fujita, F. (1995). The personality structure of affect. Journal of Personality and Social Psychology, 69(1), 130–141.

Eid, M., Lischetzke, T., Nussbeck, F. W., & Trierweiler, L. I. (2003). Separating trait effects from trait-specific method effects in multitrait-multimethod models: A multiple-indicator CT-C(M-1) model. Psychological Methods, 8(1), 38–60.
[Assumes a gold standard method without systematic measurement error (e.g., an objective measure of height or weight is available]

Funder, D. C. (1991). Global traits—a Neo-Allportian approach to personality. Psychological Science, 2(1), 31–39.

Funder, D. C. (1995). On the accuracy of personality judgment—a realistic approach. Psychological Review, 102(4), 652–670.

Goldberg, L. R. (1992). The social psychology of personality. Psychological Inquiry, 3, 89–94.

Green, D. P., Goldman, S. L., & Salovey, P. (1993). Measurement error masks bipolarity in affect ratings. Journal of Personality and Social Psychology, 64(6), 1029–1041.

Grucza, R. A., & Goldberg, L. R. (2007). The comparative validity of 11 modern personality
inventories: Predictions of behavioral acts, informant reports, and clinical indicators. Journal of Personality Assessment, 89(2), 167–187.

John, O. P., & Soto, C. J. (2007). The importance of being valid: Reliability and the process of construct validation. In R. W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (461–494). New York, NY: Guilford Press.

Kenny, D. A., & Kashy, D. A. (1992). Analysis of the multitrait-multimethod matrix by confirmatory factor analysis. Psychological Bulletin, 112(1), 165–172.

Kroh, M. (2005). Effects of interviews during body weight checks in general population surveys. Gesundheitswesen, 67(8–9), 646–655.

Lykken, D., & Tellegen, A. (1996). Happiness is a stochastic phenomenon. Psychological Science, 7(3), 186–189.

McCrae, R. R.,&Costa, P. T. (1995). Trait explanations in personality psychology. European Journal of Personality, 9(4), 231–252.

Mroczek, D. K. (2007). The analysis of longitudinal data in personality research. In R.W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research methods in personality psychology (pp. 543–556). New York, NY, US: Guilford Press.

Muthen, L. K., & Muthen, B. O. (2008). Mplus user’s guide (5th ed.). Los Angeles, CA: Muthen & Muthen. 

Noftle, E. E., & Robins, R. W. (2007). Personality predictors of academic outcomes: Big five
correlates of GPA and SAT scores. Journal of Personality and Social Psychology, 93(1), 116–130.

Ozer, D. J. (2007). Evaluating effect size in personality research. In R.W. Robins, C. R. Fraley, & R. F. Krueger (Eds.). New York, NY, US: Guilford Press.

Riemann, R., Angleitner, A., & Strelau, J. (1997). Genetic and environmental influences on personality: A study of twins reared together using the self- and peer-report NEO-FFI scales. Journal of Personality, 65(3), 449–475.

Robins, R. W., & Beer, J. S. (2001). Positive illusions about the self: Short-term benefits and long-term costs. Journal of Personality and Social Psychology, 80(2), 340–352.

Rowland, M. L. (1990). Self-reported weight and height. American Journal of Clinical Nutrition, 52(6), 1125–1133.

Schimmack, U. (2007). The structure of subjective well-being. In M. Eid, & R. J. Larsen (Eds.), The science of subjective well-being (pp. 97–123). New York: Guilford.

Schimmack, U., Bockenholt, U.,&Reisenzein, R. (2002). Response styles in affect ratings: Making a mountain out of a molehill. Journal of Personality Assessment, 78(3), 461–483.

Schimmack, U., & Oishi, S. (2005). The influence of chronically and temporarily accessible
information on life satisfaction judgments. Journal of Personality and Social Psychology,
89(3), 395–406.

Schimmack, U., Oishi, S., Furr, R. M., & Funder, D. C. (2004). Personality and life satisfaction: A facet-level analysis. Personality and Social Psychology Bulletin, 30(8), 1062–1075.

Schmidt, F. L.,& Hunter, J. E. (1996). Measurement error in psychological research: Lessons from 26 research scenarios. Psychological Methods, 1(2), 199–223.

Schneider, L., & Schimmack, U. (2009). Self-informant agreement in well-being ratings: A metaanalysis. Social Indicators Research, 94, 363–376.

Simms, L. J., & Watson, D. (2007). The construct validation approach to personality scale
construction. In R. W. Robins, C. R. Fraley, & R. F. Krueger (Eds.), Handbook of research
methods in personality psychology (240–258). New York, NY: Guilford Press.

Terracciano, A., Costa, J. P. T., & McCrae, R. R. (2006). Personality plasticity after age 30.
Personality and Social Psychology Bulletin, 32, 999–1009.

Walker, S. S., & Schimmack, U. (2008). Validity of a happiness Implicit Association Test as a measure of subjective well-being. Journal of Research in Personality, 42(2), 490–497.

Watson, D., Clark, L. A., & Tellegen, A. (1988). Development and validation of brief measures of positive and negative affect: The PANAS Scales. Journal of Personality and Social Psychology, 54(6), 1063–1070.

Watson, D., Wiese, D., Vaidya, J., & Tellegen, A. (1999). The two general activation systems of affect: Structural findings, evolutionary considerations, and psychobiological evidence. Journal of Personality and Social Psychology, 76(5), 820–838.

Zou, C., Schimmack, U., & Gere, J. (2013).  The validity of well-being measures: A multiple-indicator–multiple-rater model.  Psychological Assessment, 25, 1247-1254.

Fritz Strack asks “Have I done something wrong”

Since 2011, experimental social psychology is in crisis mode. It has become evident that social psychologists violated some implicit or explicit norms about science. Readers of scientific journals expect that the methods and results section of a scientific article provide an objective description of the study, data analysis, and results.  However, it is now clear that this is rarely the case.  Experimental social psychologists have used various questionable research practices to report mostly results that supported their theories . As a result, it is now unclear which published results are replicable and which results are not.

In response to this crisis of confidence, a new generation of social psychologists has started to conduct replication studies.  The most informative replication studies are published in a new type of article called registered replication reports (RRR)

What makes RRRs so powerful is that they are not simple replication studies.  An RRR is a collective effort to replicate an original study in multiple labs.  This makes it possible to examine generalizability of results across different populations and it makes it possible to combine the data in a meta-analysis.  The pooling of data across multiple replication studies reduces sampling error and it becomes possible to obtain fairly precise effect size estimates that can be used to provide positive evidence for the absence of an effect.  If the effect size estimate is within a reasonably small interval around zero, the results suggest that the population effect size is so close to zero that it is theoretically irrelevant. In this way, an RRR can have three possible results: (a) it replicates an original result in most of the individual studies (e.g., with 80% power, it would replicate the result in 80% of the replication attempts); (b) it fails to replicate the result in most of the replication attempts (e.g., it replicates the result in only 20% of replication studies), but the effect size in the meta-analysis is significant, or (c) it fails to replicate the original result in most studies and the meta-analytic effect size estimates suggests the effect does not exist.

Another feature of RRRs is that original authors get an opportunity to publish a response.  This blog post is about Fritz Strack’s response to the RRR of Strack et al.’s facial feeback study.  Strack et al. (1988) reported two studies that suggested incidental movement of facial muscles influences amusement in response to ratings of cartoons.  The article is the second most cited article by Strack and the most cited empirical article. It is therefore likely that Strack cared about the outcome of the replication study.

Strack

So, it may have elicited some negative feelings when the results showed that none of the replication studies produced a significant result and the meta-analysis suggested that the original result was a false positive result; that is the population effect size is close to zero and the results of the original studies were merely statistical flukes.

FFB.meta.png

Strack’s Response to the RRR

Strack’s first response to the RRR results was surprise because numerous studies had conducted replication studies of this fairly famous study before and the published studies typically, but not always, reported successful replications.  Any naive reader of the literature, review articles, or textbook is likely to have the same response.  If an article has over 600 citations, it suggests that it made a solid contribution to the literature.

However, social psychology is not a normal psychological science.  Even more famous effects like the ego-depletion effect or elderly priming have failed to replicate.  A major replication attempted that was published in 2015 showed that only 25% of social psychological studies could be replicated (OSC, 2015); and Strack had commented on this result. Thus, I was a bit surprised by Strack’s surprise because the failure to replicate his results was in line with many other replication failures since 2011.

Despite concerns about the replicabilty of social psychology, Strack expected a positive result because he had conducted a meta-analysis of 20 studies that had been published in the past five years.

If 20 previous studies successfully replicated the effect and the 17 studies of the RRR all failed to replicate the effect, it suggests the presence of a moderator; that is some variation between these two sets of studies that explains why the nonRRR studies found the effect and the RRR studies did not.

Moderator 1

First, the authors have pointed out that the original study is “commonly discussed in introductory psychology courses and textbooks” (p. 918). Thus, a majority of psychology students was assumed to be familiar with the pen study and its findings.

As the study used deception, it makes sense that the study does not work if students know about the study.  However, this hypothesis assumes that all 17 samples in the RRR were recruited from universities in which the facial feedback hypothesis was taught before they participated in the study.  Moreover, it assumes that none of the samples in the successful nonRRR studies had the same problem.

However, Strack does not compare nonRRR studies to RRR studies.  Instead, he focuses on three RRR samples that did not use student samples (Holmes, Lynott, and Wagenmakers).  None of the three samples individually show a significant result. Thus, none of these studies replicated the original findings.  Strack conducts a meta-analysis of these three studies and finds an average mean difference of d = .16.

Table 1

Dataset N M-teeth M-lips SD-teeth SD-lips Cohen’s d
Holmes 130 4.94 4.79 1.14 1.3 0.12
Lynott 99 4.91 4.71 1.49 1.31 0.14
Wagenmakers 126 4.54 4.18 1.42 1.73 0.23
Pooled 355 4.79 4.55 1.81 2.16 0.12

The standardized effect size of d = .16 is the unweighted average of the three d-values in Table 1.  The weighted average is d = .17.  However, if the three studies are first combined into a single dataset and the standardized mean difference is computed from the combined dataset, the standardized mean difference is only d = .12

More importantly, the standard error for the pooled data is 2 / sqrt(355) = 0.106, which means the 95% confidence interval around any of these point estimates is 0.106 * 1.96 = .21 standard deviations wide on each side of the point estimate.  Even with d = .17, the 95%CI (-.04 to .48) includes 0.  At the same time, the effect size in the original study was approximately d ~ .5, suggesting that the original results were extreme outliers or that additional moderating factors account for the discrepancy.

Strack does something different. He tests the three proper studies against the average effect size of the “improper” replication studies with student samples.  These studies have an average effect size of d = -0.03.  This analysis shows a significant difference  (proper d = .16 and improper d = -.03) , t(15) = 2.35, p = .033.

This is an interesting pattern of results. The significant moderation effect suggests that that facial feedback effects were stronger in the 3 studies identified by Strack than in the remaining 14 studies.  At the same time, the average effect size of the three proper replication studies is still not significant, despite a pooled sample size that is three times larger than the sample size in the original study.

One problem for this moderator hypothesis is that the research protocol made it clear that the study had to be conducted before the original study was covered in a course (Simons response to Strack).  However, maybe student samples fail to show the effect for another reason.

The best conclusion that can be drawn from these results is that the effect may be greater than zero, but that the effect size in the original studies was probably inflated.

Cartoons were not funny

Strack’s second concern is weaker and he is violating some basic social psychological rules about argument strength. Adding weak arguments increases persuasiveness if the audience is not very attentive, but they backfire with an attentive audience.

Second, and despite the obtained ratings of funniness, it must be asked if Gary Larson’s The Far Side cartoons that were iconic for the zeitgeist of the 1980s instantiated similar psychological conditions 30 years later. It is indicative that one of the four exclusion criteria was participants’ failure to understand the cartoons.  

One of the exclusion criteria was failure to understand the cartoons, but how many participants were excluded because of this criterion?  Without this information, this is clearly not relevant and if it was an exclusion criterion and only participants who understood the cartoons were used it is not clear how this could explain the replication failure.  Moreover, Strack just tried to claim that proper studies did show the effect, which they could not show if the cartoons were not funny.  Finally, the means clearly show that participants reported being amused by the cartoons.

Weak arguments like this undermine more reasonable arguments like the previous one.

Using A Camera 

Third, it should be noted that to record their way of holding the pen, the RRR labs deviated from the original study by directing a camera on the participants. Based on results from research on objective self-awareness, a camera induces a subjective self-focus that may interfere with internal experiences and suppress emotional responses.

This is a serious problem and in hindsight the authors of the RRR are probably regretting the decision to use cameras.  A recent article actually manipulated the presence or absence of a camera and found stronger effects without a camera, although the predicted interaction effect was not significant. Nevertheless, the study suggested that creating self-awareness with a camera could be a moderator and might explain why the RRR studies failed to replicate the original effect.

Reverse p-hacking 

Strack also noticed a statistical anomaly in the data. When he correlated the standardized mean differences (d-values) with the sample sizes, a non-significant positive correlation emerged, r(17) = .45, p = .069.   This correlation shows a statistical trend for larger samples to produce larger effect sizes.  Even if this correlation were significant, it is not clear what conclusions should be drawn from this observation.  Moreover, earlier Strack distinguished student samples and non-student samples as a potential moderator.  It makes sense to include this moderator in the analysis because it could be confounded with sample size.

A regression analysis shows that the effect of proper sample is no longer significant, t(14) = 1.75, and the effect of sampling error (1/sqrt(N) is also not significant, t(14) = 0.59.

This new analysis suggests that sample size is not a predictor of effect sizes, which makes sense because there is no reasonable explanation for such a positive correlation.

However, now Strack makes a mistake by implying that the weaker effect sizes in smaller samples could be a sign of  “reverse p-hacking.”

Without insinuating the possibility of a reverse p hacking, the current anomaly
needs to be further explored.

The rhetorical vehicle of “Without insinuating” can be used to backtrack from the insinuation that was never made, but Strack is well aware of priming research and the ironic effect of instructions “not to think about a white bear”  (you probably didn’t think about one until now and nobody was thinking about reverse p-hacking until Strack mentioned it).

Everybody understood what he was implying (reverse p-hacking = intentional manipulations of the data to make significance disappear and discredit original researchers in order to become famous with failures to replicate famous studies) (Crystal Prison Zone blog; No Hesitations blog).

revers.phack.twitter.png

The main mistake that Strack makes is that a negative correlation between sample size and effect size can suggest that researches were p-hacking (inflating effect sizes in small samples to get significance), but a positive correlation does not imply reverse p-hacking (making significant results disappear).  Reverse p-hacking also implies a negative trend line where researchers like Wagenmakers with larger samples (the 2nd largest sample in the set), who love to find non-significant results that Bayesian statistics treat as evidence for the absence of an effect, would have to report smaller effect sizes to avoid significance or Bayes-Facotors in favor of an effect.

So, here is Strack’s fundamental error.   He correctly assumes that p-hacking results in a negative correlation, but he falsely assumes that reverse p-hacking would produce a positive correlation and then treats a positive correlation as evidence for reverse p-hacking. This is absurd and the only way to backtrack from this faulty argument is to use the “without insinuating” hedge (“I never said that this correlation implies reverse p-hacking, in fact I made clear that I am not insinuating any of this.”)

Questionable Research Practices as Moderator 

Although Strack mentions two plausible moderators (student samples, camera), there are other possible moderators that could explain the discrepancies between his original results and the RRR results.  One plausible moderator that Strack does not mention is that the original results were obtained with questionable research practices.

Questionable research practices is a broad term for a variety of practices that undermine the credibility of published results  (John et al., 2012), including fraud.

To be clear, I am not insinuating that Strack fabricated data and I have said so in public before.  Let’ me be absolutely clear because the phrasing I used in the previous sentence is a stab at Strack’s reverse p-hacking quote, which may be understood as implying exactly what is not being insinuated.  I positively do not believe that Fritz Strack faked data for the 1988 article or any other article.

One reason for my belief is that I don’t think anybody would fake data that produce only marginally significant results that some editors might reject as insufficient evidence. If you fake your data, why fake p = .06, if you can fake p = .04?

If you take a look at the Figure above, you see that the 95%CI of the original study includes zero.  That shows that the original study did not show a statistically significant result.  However, Strack et al. (1988) used a more liberal criterion of .10 (two-tailed) or .05 (one-tailed) to test significance and with this criterion the results were significant.

The problem is that it is unlikely for two independent studies to produce two marginally significant results in a row.  This is either an unlikely fluke or some other questionable research practices were used to get these results. So, yes, I am insinuating that questionable research practices may have inflated the effect sizes in Strack et al.’s studies and that this explains at least partially why the replication study failed.

It is important to point out that in 1988, questionable research practices were not considered questionable. In fact, experimental social psychologists were trained and trained their students to use these practices to get significance (Stangor, 2012).  Questionable research practices also explain why the reproducibilty project could only replicate 25% of published results in social and personality psychology (OSC, 2015).  Thus, it is plausible that QRPs also contributed to the discrepancy between Strack et al.’s original studies and the RRR results.

The OSC reproducibility project estimated that QRPs inflate effect sizes by 100%. Thus, an inflated effect size of d = .5 in Strack et al., (1988) might correspond to a true effect size of d = .25 (.25 real + 25 inflation = .50 observed).  Moreover, the inflation increases as p-values get closer to .05.  Thus, for a marginally significant result, inflation is likely to be greater than 100% and the true effect size is likely to be even smaller than .25. This suggest that the true effect size could be around d = .2, which is close to the effect size estimate of the “proper” RRR studies identified by Strack.

A meta-analysis of facial feedback studies also produced an average effect size estimate of d = .2, but this estimate is not corrected for publication bias, while the meta-analysis also showed evidence of publication bias.  Thus, the true average effect size is likely to be lower than .2 standard deviations.  Given the heterogeneity of studies in this meta-analysis it is difficult to know which specific paradigms are responsible for this average effect size and could be successfully replicated (Coles & Larsen, 2017).  The reason is that existing studies are severely underpowered to reliably detect effects of this magnitude (Replicability Report of Facial Feedback studies).

The existing evidence suggests that effect sizes of facial feedback studies are somewhere between 0 and .5, but it is impossible to say whether it is just slightly above zero with no practical significance or whether it is of sufficient magnitude so that a proper set of studies can reliably replicate the effect.  In short, 30 years and over 600 citations later, it is still unclear whether facial feedback effects exist and under which conditions this effect can be observed in the laboratory.

Did Fritz Strack use Quesitonable Research Practices

It is difficult to demonstrate conclusively that a researcher used QRPs based on a couple of statistical results, but it is possible to examine this question with larger sets of data. Brunner & Schimmack (2018)  developed a method, z-curve, that can reveal the use of QRPs for large sets of statistical tests.  To obtain a large set of studies, I used automatic text extraction of test statistics from articles co-authored by Fritz Strack.  I then applied z-curve to the data.  I limited the analysis to studies before 2010 when social psychologists did not consider QRPs problematic, but rather a form of creativity and exploration.   Of course, I view QRPs differently, but the question whether QRPs are wrong or not is independent of the question whether QRPs were used.

Strack.zcurve.png

This figure (see detailed explanation here) shows the strength of evidence (based on test statistics like t and F-values converted into z-scores in Strack’s articles. The histogram shows a mode at 2, which is just significant (z = 1.96 ~ p = .05, two-tailed).  The high bar of z-scores between 1.8 and 2 shows marginally significant results as does the next bar with z-scores between 1.6 and 1.8 (1.65 = .05 one-tailed).  The drop from 2 to 1.6 is too steep to be compatible with sampling error.

The grey line provides a vague estimate of the expected proportion of non-significant results. The so called file-drawer (non-significant results that are not reported) is very large and it is unlikely that so many studies were attempted and failed. For example, it is unlikely that Strack conducted 10 more studies with N = 100 participants and did not report the results of these studies because they produced non-significant result.  Thus, it is likely that other QRPs were used that help to produce significance.  It is impossible to say how significant results were produced, but the distribution z-scores strongly suggests that QRPS were used.

The z-curve method provides an estimate of the average power of significant results. The estimate is 63%.  This value means that a randomly drawn significant result from one of Strack’s articles has a 63% probability of producing a significant results again in an exact replication study with the same sample size.

This value can be compared with estimates for social psychology using the same method. For example, the average for the Journal of Experimental Social Psychology from 2010-2015 is 59%.  Thus, the estimate for Strack is consistent with general research practices in his field.

One caveat with the z-curve estimate of 63% is that the dataset includes theoretically important and less important tests.  Often the theoretically important tests have lower z-scores and the average power estimates for so-called focal tests are about 20-30 percentage points lower than the estimate for all tests.

A second caveat is that there is heterogeneity in power across studies. Studies with high power are more likely to produce really small p-values and larger z-scores. This is reflected in the estimates below the x-axis for different segments of studies.  The average for studies with just significant results (z = 2 to 2.5) is only 45%.  The estimate for marginally significant results is even lower.

As 50% power corresponds to the criterion for significance, it is possible to use z-curve as a way to adjust p-values for the inflation that is introduced by QRPs.  Accordingly, only z-scores greater than 2.5 (~ p = .01) would be significant at the 5% level after correcting for QRPs.  However, once QRPs are being used, bias-corrected values are just estimates and validation with actual replication studies is needed.  Using this correction, Strack’s original results would not even meet the weak standard of for marginal significance.

Final Words

In conclusion, 30 years of research have failed to provide conclusive evidence for or against the facial feedback hypothesis. The lack of progress is largely due to a flawed use of the scientific method.  As journals published only successful outcomes, empirical studies failed to provide empirical evidence for or against a hypothesis derived from a theory.  Social psychologists only recognized this problem in 2011, when Bem was able to provide evidence for an incredible phenomenon of erotic time travel.

There is no simple answer to Fritz Strack’s question “Have I done something wrong?” because there is no objective standard to answer this question.

Did Fritz Strack fail to produce empirical evidence for the facial feedback hypothesis because he used the scientific method wrong?  I believe the answer is yes.

Did Fritz Strack do something morally wrong in doing so? I think the answer is no. He was trained and trained students in the use of a faulty method.

A more difficult question is whether Fritz Strack did something wrong in his response to the replication crisis and the results of the RRR.

We can all make honest mistakes and it is possible that I made some honest mistakes when I wrote this blog post.  Science is hard and it is unavoidable to make some mistakes.  As the German saying goes (Wo gehobelt wird fallen Späne) which is equivalent to saying “you can’t make an omelette without breaking some eggs.

Germans also have a saying that translates to “those who work a lot make a lot of mistakes, those who do nothing make no mistakes”  Clearly Fritz Strack did a lot for social psychology, so it is only natural that he also made mistakes.  The question is how scientists respond to criticism and discovery of mistakes by other scientists.

The very famous social psychologists Susan Fiske (2015) encouraged her colleagues to welcome humiliation.  This seems a bit much, especially for Germans who love perfection and hate making mistakes.  However, the danger of a perfectionistic ideal is that criticism can be interpreted as a personal attack with unhealthy consequences. Nobody is perfect and the best way to deal with mistakes is to admit them.

Unfortunately, many eminent social psychologists seem to be unable to admit that they used QRPs and that replication failures of some of their famous findings are to be expected.  It doesn’t require rocket science to realize that p-hacked results do not replicate without p-hacking. So, why is it so hard to admit the truth that everybody knows anyways?

It seems to be human nature to cover up mistakes. Maybe this is an embodied reaction to shame like trying to cover up when a stranger sees us naked.  However, typically this natural response is worse and it is better to override it to avoid even more severe consequences. A good example is Donald Trump. Surely, having sex with a pornstar is a questionable behavior for a married man, but this is no longer the problem for Donald Trump.  His presidency may end early not because he had sex with Stormy Daniels, but because he lied about it.  As the saying goes, the cover-up is often worse than the crime. Maybe there is even a social psychological experiment to prove it, p = .04.

P.S.  There is also a difference between not doing something wrong and doing something right.  Fritz, you can still do the right thing and retract your questionable statement about reverse-phacking and help new generations to avoid some of the mistakes of the past.

 

 

 

 

 

 

 

 

 

 

 

 

Charles Stangor’s Failed Attempt to Predict the Future

Background

It is 2018, and 2012 is a faint memory.  So much has happened in the word and in
psychology over the past six years.

Two events rocked Experimental Social Psychology (ESP) in the year 2011 and everybody was talking about the implications of these events for the future of ESP.

First, Daryl Bem had published an incredible article that seemed to suggest humans, or at least extraverts, have the ability to anticipate random future events (e.g., where an erotic picture would be displayed).

Second, it was discovered that Diederik Stapel had fabricated data for several articles. Several years later, over 50 articles have been retracted.

Opinions were divided about the significance of these two events for experimental social psychology.  Some psychologists suggested that these events are symptomatic of a bigger crisis in social psychology.  Others considered these events as exceptions with little consequences for the future of experimental social psychology.

In February 2012, Charles Stangor tried to predict how these events will shape the future of experimental social psychology in an essay titled “Rethinking my Science

How will social and personality psychologists look back on 2011? With pride at having continued the hard work of unraveling the mysteries of human behavior, or with concern that the only thing that is unraveling is their discipline?

Stangor’s answer is clear.

“Although these two events are significant and certainly deserve our attention, they are flukes rather than game-changers.”

He describes Bem’s article as a “freak event” and Stapel’s behavior as a “fluke.”

“Some of us probably do fabricate data, but I imagine the numbers are relatively few.”

Stangor is confident that experimental social psychology is not really affected by these two events.

As shocking as they are, neither of these events create real problems for social psychologists

In a radical turn, Stangor then suggests that experimental social psychology will change, but not in response to these events, but in response to three other articles.

But three other papers published over the past two years must completely change how we think about our field and how we must conduct our research within it. And each is particularly important for me, personally, because each has challenged a fundamental assumption that was part of my training as a social psychologist.

Student Samples

The first article is a criticism of experimental social psychology for relying too much on first-year college students as participants (Heinrich, Heine, & Norenzayan, 2010).  Looking back, there is no evidence that US American psychologists have become more global in their research interests. One reason is that social phenomena are sensitive to the cultural context and for Americans it is more interesting to study how online dating is changing relationships than to study arranged marriages in more traditional cultures. There is nothing wrong with a focus on a particular culture.  It is not even clear that research article on prejudice against African Americans were supposed to generalize to the world (how would this research apply to African countries where the vast majority of citizens are black?).

The only change that occurred was not in response to Heinrich et al.’s (2010) article, but in response to technological changes that made it easier to conduct research and pay participants online.  Many social psychologists now use the online service Mturk to recruit participants.

Thus, I don’t think this article significantly changed experimental social psychology.

Decline Effect 

The second article with the title (“The Truth Wears Off“) was published in the weekly magazine the New Yorker.  It made the ridiculous claim that true effects become weaker or may even disappear over time.

The basic phenomenon is that observed findings in the social and biological sciences weaken with time. Effects that are easily replicable at first become less so every day. Drugs stop working over time the same way that social psychological phenomena become more and more elusive. The “the decline effect” or “the truth wears off effect,” is not easy to dismiss, although perhaps the strength of the decline effect will itself decline over time.

The assumption that the decline effect applies to real effects is no more credible than Bem’s claims of time-reversed causality.   I am still waiting for the effect of eating cheesecake on my weight (a biological effect) to wear off. My bathroom scale tells me it is not.

Why would Stangor believe in such a ridiculous idea?  The answer is that he observed it many times in his own work.

Frankly I have difficulty getting my head around this idea (I’m guessing others do too) but it is nevertheless exceedingly troubling. I know that I need to replicate my effects, but am often unable to do it. And perhaps this is part of the reason. Given the difficulty of replication, will we continue to even bother? And what becomes of our research if we do even less replicating than we do now? This is indeed a problem that does not seem likely to go away soon. 

In hindsight, it is puzzling that Stangor misses the connection between Bem’s (2011) article and the decline effect.   Bem published 9 successful results with p < .05.  This is not a fluke. The probability to get lucky 9 times in a row with a probability of just 5% for a single event is very very small (less than 1 in a billion attempts).  It is not a fluke. Bem also did not fabricate data like Stapel, but he falsified data to present results that are too good to be true (Definitions of Research Misconduct).  Not surprisingly, neither he nor others can replicate these results in transparent studies that prevent the use of QRPs (just like paranormal phenomena like spoon bending can not be replicated in transparent experiments that prevent fraud).

The decline effect is real, but it is wrong to misattribute it to a decline in the strength of a true phenomenon.  The decline effect occurs when researchers use questionable research practices (John et al., 2012) to fabricate statistically significant results.  Questionable research practices inflate “observed effect sizes” [a misnomer because effects cannot be observed]; that is, the observed mean differences between groups in an experiment.  Unfortunately, social psychologists do not distinguish between “observed effects sizes” and true or population effect sizes. As a result, they believe in a mysterious force that can reduce true effect sizes when sampling error moves mean differences in small samples around.

In conclusion, the truth does not wear off because there was no truth to begin with. Bem’s (2011) results did not show a real effect that wore off in replication studies. The effect was never there to begin with.

P-Hacking

The third article mentioned by Stangor did change experimental social psychology.  In this article, Simmons, Nelson, and Simonsohn (2011) demonstrate the statistical tricks experimental social psychologists have used to produce statistically significant results.  They call these tricks, p-hacking.  All methods of p-hacking have one common feature. Researchers conduct mulitple statistical analysis and check the results. When they find a statistically significant result, they stop analyzing the data and report the significant result.  There is nothing wrong with this practice so far, but it essentially constitutes research misconduct when the result is reported without fully disclosing how many attempts were made to get it.  The failure to disclose all attempts is deceptive because the reported result (p < .05) is only valid if a researcher collected data and then conducted a single test of a hypothesis (it does not matter whether this hypothesis was made before or after data collection).  The point is that at the moment a researcher presses a mouse button or a key on a keyboard to see a p-value,  a statistical test occurred.  If this p-value is not significant and another test is run to look at another p-value, two tests are conducted and the risk of a type-I error is greater than 5%. It is no longer valid to claim p < .05, if more than one test was conducted.  With extreme abuse of the statistical method (p-hacking), it is possible to get a significant result even with randomly generated data.

In 2010, the Publication Manual of the American Psychological Association advised researchers that “omitting troublesome observations from reports to present a more convincing story is also prohibited” (APA).  It is telling that Stangor does not mention this section as a game-changer, because it has been widely ignored by experimental psychologists until this day.  Even Bem’s (2011) article that was published in an APA journal violated this rule, but it has not been retracted or corrected so far.

The p-hacking article had a strong effect on many social psychologists, including Stangor.

Its fundamental assertions are deep and long-lasting, and they have substantially affected me. 

Apparently, social psychologists were not aware that some of their research practices undermined the credibility of their published results.

Although there are many ways that I take the comments to heart, perhaps most important to me is the realization that some of the basic techniques that I have long used to collect and analyze data – techniques that were taught to me by my mentors and which I have shared with my students – are simply wrong.

I don’t know about you, but I’ve frequently “looked early” at my data, and I think my students do too. And I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.” Over the years my students have asked me about these practices (“What do you recommend, Herr Professor?”) and I have
routinely, but potentially wrongly, reassured them that in the end, truth will win out. 

Although it is widely recognized that many social psychologists p-hacked and buried studies that did not work out,  Stangor’s essay remains one of the few open admissions that these practices were used, which were not considered unethical, at least until 2010. In fact, social psychologists were trained that telling a good story was essential for social psychologists (Bem, 2001).

In short, this important paper will – must – completely change the field. It has shined a light on the elephant in the room, which is that we are publishing too many Type-1 errors, and we all know it.

Whew! What a year 2011 was – let’s hope that we come back with some good answers to these troubling issues in 2012.

In hindsight Stangor was right about the p-hacking article. It has been cited over 1,000 times so far and the term p-hacking is widely used for methods that essentially constitute a violation of research ethics.  P-values are only meaningful if all analyses are reported and failures to disclose analyses that produced inconvenient non-significant results to tell a more convincing story constitutes research misconduct according to the guidelines of APA and the HHS.

Charles Stangor’s Z-Curve

Stangor’s essay is valuable in many ways.  One important contribution is the open admission to the use of QRPs before the p-hacking article made Stangor realize that doing so was wrong.   I have been working on statistical methods to reveal the use of QRPs.  It is therefore interesting to see the results of this method when it is applied to data by a researcher who used QRPs.

stangor.png

This figure (see detailed explanation here) shows the strength of evidence (based on test statistics like t and F-values converted into z-scores in Stangor’s articles. The histogram shows a mode at 2, which is just significant (z = 1.96 ~ p = .05, two-tailed).  The steep drop on the left shows that Stangor rarely reported marginally significant results (p = .05 to .10).  It also shows the use of questionable research practices because sampling error should produce a larger number of non-significant results than are actually observed. The grey line provides a vague estimate of the expected proportion of non-significant results. The so called file-drawer (non-significant results that are not reported) is very large.  It is unlikely that so many studies were attempted and not reported. As Stangor mentions, he also used p-hacking to get significant results.  P-hacking can produce just significant results without conducting many studies.

In short, the graph is consistent with Stangor’s account that he used QRPs in his research, which was common practice and even encouraged, and did not violate any research ethics code of the times (Bem, 2001).

The graph also shows that the significant studies have an estimated average power of 71%.  This means any randomly drawn statistically significant result from Stangor’s articles has a 71% chance of producing a significant result again, if the study and the statistical test were replicated exactly (see Brunner & Schimmack, 2018, for details about the method).  This average is not much below the 80% value that is considered good power.

There are two caveats with the 71% estimate. One caveat is that this graph uses all statistical tests that are reported, but not all of these tests are interesting. Other datasets suggest that the average for focal hypothesis tests is about 20-30 percentage points lower than the estimate for all tests. Nevertheless, an average of 71% is above average for social psychology.

The second caveat is that there is heterogeneity in power across studies. Studies with high power are more likely to produce really small p-values and larger z-scores. This is reflected in the estimates below the x-axis for different segments of studies.  The average for studies with just significant results (z = 2 to 2.5) is only 49%.  It is possible to use the information from this graph to reexamine Stangor’s articles and to adjust nominal p-values.  According to this graph p-values in the range between .05 and .01 would not be significant because 50% power corresponds to a p-value of .05. Thus, all of the studies with a z-score of 2.5 or less (~ p > .01) would not be significant after correcting for the use of questionable research practices.

The main conclusion that can be drawn from this analysis is that the statistical analysis of Stangor’s reported results shows convergent validity with the description of his research practices.  If test statistics by other researchers show a similar (or worse) distribution, it is likely that they also used questionable research practices.

Charles Stangor’s Response to the Replication Crisis 

Stangor was no longer an active researcher when the replication crisis started. Thus, it is impossible to see changes in actual research practices.  However, Stangor co-edited a special issue for the Journal of Experimental Social Psychology on the replication crisis.

The Introduction mentions the p-hacking article.

At the same time, the empirical approaches adopted by social psychologists leave room for practices that distort or obscure the truth (Hales, 2016-in this issue; John, Loewenstein, & Prelec, 2012; Simmons, Nelson, & Simonsohn, 2011)

and that

social psychologists need to do some serious housekeeping in order to progress
as a scientific enterprise.

It quotes, Dovidio to claim that social psychologists are

lucky to have the problem. Because social psychologists are rapidly developing new approaches and techniques, our publications will unavoidably contain conclusions that are uncertain, because the potential limitations of these procedures are not yet known. The trick then is to try to balance “new” with “careful.

It also mentions the problem of fabricating stories by hiding unruly non-significant results.

The availability of cheap data has a downside, however,which is that there is little cost in omitting data that contradict our hypotheses from our manuscripts (John et al., 2012). We may bury unruly data because it is so cheap and plentiful. Social psychologists justify this behavior, in part, because we think conceptually. When a manipulation fails, researchers may simply argue that the conceptual variable was not created by that particular manipulation and continue to seek out others that will work. But when a study is eventually successful,we don’t know if it is really better than the others or if it is instead a Type I error. Manipulation checks may help in this regard, but they are not definitive (Sigall &Mills, 1998).

It also mentioned file-drawers with unsuccessful studies like the one shown in the Figure above.

Unpublished studies likely outnumber published studies by an order of magnitude. This is wasteful use of research participants and demoralizing for social psychologists and their students.

It also mentions that governing bodies have failed to crack down on the use of p-hacking and other questionable practices and the APA guidelines are not mentioned.

There is currently little or no cost to publishing questionable findings

It foreshadows calls for a more stringent criterion of statistical significance, known as the p-value wars (alpha  = .05 vs. alpha = .005 vs. justify your alpha vs. abandon alpha)

Researchers base statistical analyses on the standard normal distribution but the actual tails are probably bigger than this approach predicts. It is clear that p b .05 is not enough to establish the credibility of an effect. For example, in the Reproducibility Project (Open Science Collaboration, 2015), only 18% of studies with a p-value greater than .04 replicated whereas 63% of those with a p-value less than .001 replicated. Perhaps we should require, at minimum, p <  .01 

It is not clear, why we should settle for p < .01, if only 63% of results replicated with p < .001. Moreover, it ignores that a more stringent criterion for significance also increases the risk of type-II error (Cohen).  It also ignores that only two studies are required to reduce the risk of a type-I error from .05 to .05*.05 = .0025.  As many articles in experimental social psychology are based on multiple cheap studies, the nominal type-I error rate is well below .001.  The real problem is that the reported results are not credible because QRPs are used (Schimmack, 2012).  A simple and effective way to improve experimental social psychology would be to enforce the APA ethics guidelines and hold violators of these rules accountable for their actions.  However, although no new rules would need to be created, experimental social psychologists are unable to police themselves and continue to use QRPs.

The Introduction ignores this valid criticism of multiple study and continues to give the misleading impression that more studies translate into more replicable results.  However, the Open-Science Collaboration reproducibility project showed no evidence that long, multiple-study articles reported more replicable results than shorter articles in Psychological Science.

In addition, replication concerns have mounted with the editorial practice of publishing short papers involving a single, underpowered study demonstrating counterintuitive results (e.g., Journal of Experimental Social Psychology; Psychological Science; Social Psychological and Personality Science). Publishing newsworthy results quickly has benefits,
but also potential costs (Ledgerwood & Sherman, 2012), including increasing Type 1 error rates (Stroebe, 2016-in this issue). 

Once more, the problem is dishonest reporting of results.  A risky study can be published and a true type-I error rate of 20% informs readers that there is a high risk of a false positive result. In contrast, 9 studies with a misleading type-I error rate of 5% violate the implicit assumptions that readers can trust a scientific research article to report the results of an objective test of a scientific question.

But things get worse.

We do, of course, understand the value of replication, and publications in the premier social-personality psychology journals often feature multiple replications of the primary findings. This is appropriate, because as the number of successful replications increases, our confidence in the finding also increases dramatically. However, given the possibility
of p-hacking (Head, Holman, Lanfear, Kahn, & Jennions, 2015; Simmons et al., 2011) and the selective reporting of data, replication is a helpful but imperfect gauge of whether an effect is real. 

Just like Stangor dismissed Bem’s mulitple-study article in JPSP as a fluke that does not require further attention, he dismisses evidence that QRPs were used to p-hack other multiple study articles (Schimmack, 2012).  Ignoring this evidence is just another violation of research ethics. The data that are being omitted here are articles that contradict the story that an author wants to present.

And it gets worse.

Conceptual replications have been the field’s bread and butter, and some authors of the special issue argue for the superiority of conceptual over exact replications (e.g. Crandall & Sherman, 2016-in this issue; Fabrigar and Wegener, 2016–in this issue; Stroebe, 2016-in this issue).  The benefits of conceptual replications are many within social psychology, particularly because they assess the robustness of effects across variation in methods, populations, and contexts. Constructive replications are particularly convincing because they directly replicate an effect from a prior study as exactly as possible in some conditions but also add other new conditions to test for generality or limiting conditions (Hüffmeier, 2016-in this issue).

Conceptual replication is a euphemism for story telling or as Sternberg calls it creative HARKing (Sternberg, in press).  Stangor explained earlier how an article with several conceptual replication studies is constructed.

I certainly bury studies that don’t work, let alone fail to report dependent variables that have been uncooperative. And I have always argued that the researcher has the obligation to write the best story possible, even if may mean substantially “rewriting the research hypothesis.”

This is how Bem advised generations of social psychologists to write articles and that is how he wrote his 2011 article that triggered awareness of the replicability crisis in social psychology.

There is nothing wrong with doing multiple studies and to examine conditions that make an effect stronger or weaker.  However, it is psuedo-science if such a program of research reports only successful results because reporting only successes renders statistical significance meaningless (Sterling, 1959).

The miraculous conceptual replications of Bem (2011) are even more puzzling in the context of social psychologists conviction that their effects can decrease over time (Stangor, 2012) or change dramatically from one situation to the next.

Small changes in social context make big differences in experimental settings, and the same experimental manipulations create different psychological states in different times, places, and research labs (Fabrigar andWegener, 2016–in this issue). Reviewers and editors would do well to keep this in mind when evaluating replications. 

How can effects be sensitive to context and the success rate in published articles is 95%?

And it gets worse.

Furthermore, we should remain cognizant of the fact that variability in scientists’ skills can produce variability in findings, particularly for studies with more complex protocols that require careful experimental control (Baumeister, 2016-in this issue). 

Baumeister is one of the few other social psychologists who has openly admitted not disclosing failed studies.  He also pointed out that in 2008 this practice did not violate APA standards.  However, in 2016 a major replication project failed to replicate the ego-depletion effect that he first “demonstrated” in 1998.  In response to this failure, Baumeister claimed that he had produced the effect many times, suggesting that he has some capabilities that researchers who fail to show the effect lack (in his contribution to the special issue in JESP he calls this ability “flair”).  However, he failed to mention that many of his attempts failed to show the effect and that his high success rate in dozens of articles can only be explained by the use of QRPs.

While there is ample evidence for the use of QRPs, there is no empirical evidence for the claim that research expertise matters.  Moreover, most of the research is carried out by undergraduate students supervised by graduate students and the expertise of professors is limited to designing studies and not to actually carrying out studies.

In the end, the Introduction also comments on the process of correcting mistakes in published articles.

Correctors serve an invaluable purpose, but they should avoid taking an adversarial tone. As Fiske (2016–this issue) insightfully notes, corrective articles should also
include their own relevant empirical results — themselves subject to
correction.

This makes no sense. If somebody writes an article and claims to find an interaction effect based on a significant result in one condition and a non-significant result in another condition, the article makes a statistical mistake (Gelman & Stern, 2005). If a pre-registration contains the statement that an interaction is predicted and a published article claims an interaction is not necessary, the article misrepresents the nature of the preregistration.  Correcting mistakes like this is necessary for science to be a science.  No additional data are needed to correct factual mistakes in original articles (see, e.g., Carlsson, Schimmack, Williams, & Bürkner, 2017).

Moreover, Fiske has been inconsistent in her assessment of psychologists who have been motivated by the events of 2011 to improve psychological science.  On the one hand, she has called these individuals “method terrorists” (2016 review).  On the other hand, she suggests that psychologists should welcome humiliation that may result from the public correction of a mistake in a published article.

Conclusion

In 2012, Stangor asked “How will social and personality psychologists look back on 2011?” Six years later, it is possible to provide at least a temporary answer. There is no unified response.

The main response by older experimental social psychologist has been denial along Stangor’s initial response to Stapel and Bem.  Despite massive replication failures and criticism, including criticism by Noble Laureate Daniel Kahneman, no eminent social psychologists has responded to the replication crisis with an admission of mistakes.  In contrast, the list of eminent social psychologists who stand by their original findings despite evidence for the use of QRPs and replication failures is long and is growing every day as replication failures accumulate.

The response by some younger social psychologists has been to nudge social psychologists slowly towards improving their research methods, mainly by handing out badges for preregistrations of new studies.  Although preregistration makes it more difficult to use questionable research practices, it is too early to see how effective preregistration is in making published results more credible.  Another initiative is to conduct replication studies. The problem with this approach is that the outcome of replication studies can be challenged and so far these studies have not resulted in a consensual correction in the scientific literature. Even articles that reported studies that failed to replicate continue to be cited at a high rate.

Finally, some extremists are asking for more radical changes in the way social psychologists conduct research, but these extremists are dismissed by most social psychologists.

It will be interesting to see how social psychologists, funding agencies, and the general public will look back on 2011 in 2021.  In the meantime, social psychologists have to ask themselves how they want to be remembered and new investigators have to examine carefully where they want to allocate their resources.  The published literature in social psychology is a mine field and nobody knows which studies can be trusted or not.

I don’t know about you, but I am looking forward to reading the special issues in 2021 in celebration of the 10-year anniversary of Bem’s groundbreaking or should I saw earth-shattering publication of “Feeling the Future.”

Estimating Reproducibility of Psychology (No. 111): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article examines anchoring effects.  When individuals are uncertain about a quantity (the price of a house, the height of Mount Everest), their estimates can be influenced by some arbitrary prior number.  The anchoring effect is a robust phenomenon that has been replicated in one of the first large multi-lab replication projects (Klein et al., 2014).

This article titled “Precision of the anchor influence the amount of adjustment” tested the hypothesis that the anchoring effect is larger (or the adjustment effect is smaller) if the anchor is precise than if it is rounded in six studies.

Janiszewski.png

Study 1

43 students participated in this study that manipulated the type of anchor between subjects (rounded, precise over, precise under; n = 14 per cell).

The main effect of the manipulation was significant, F(2,40) = 10.94.

Study 2

85 students participated in this study.  It manipulated anchor (rounding, precise under) and the range of plausible values (narrow vs. broad).

The study replicated a main effect of anchor, F(1,81) = 22.23.

Study 3

45 students participated in this study.

Study 3 added a condition with information that made the rounded anchor more credible.

The results were significant, F(2,42) = 23.07.  A follow up test showed that participants continued to be more influenced by a precise anchor than by a rounded anchor even with additional information that the rounded anchor was credible, F(1, 42) = 20.80.

Study 4a 

This study was picked for the replication attempt.

As the motivation to adjust increases and the number of units of adjustment increases correspondingly, the amount of adjustment on the coarse-resolution
scale should increase at a faster rate than the amount of adjustment on the fine-resolution scale (i.e., motivation to adjust and scale resolution should interact).

The high-motivation-to-adjust condition was created by removing information from the scenarios used in Experiment 2 (the scenarios from Experiment 2 were used without alteration in the low-motivation-to-adjust condition). For example, sentences
in the plasma-TV scenario that encouraged a slight adjustment (‘‘items are priced very close to their actual cost . . . actual cost would be only slightly less than $5,000’’) were replaced with a sentence that encouraged more adjustment (‘‘What is your estimate of the TV’s actual cost?’’).

The width of the scale unit was manipulated with the precision of the anchor (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a had 59 participants.

Study 4a was similar to Study 1 with a manipulation of the width of the scale. (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a showed an interaction between the motivation to adjust condition and the Scale Width manipulation, F(1, 55) = 6.88.

Study 4b

Study 4b  had 149 participants and also showed a significant result, F(1,145) = 4.01, p = .047.

Study 5 

Study 5 used home-sales data for 12,581 home sales.  The study found a significant effect of list-price precision on the sales price, F(1, 12577) = 23.88, with list price as a covariate.

Summary

In conclusion, all of the results showed strong statistical evidence against the null hypothesis except for the pair of studies 4a and 4b.  It is remarkable that this close replication study produced a just significant result with three times with a much larger sample size than Study 4a (149 vs. 59).  This pattern of results suggest that the sample size is not independent of the result and that the evidence for this effect could be exeggerated by  the use of optional stopping (collecting more data until p < .05).

Replication Study

The replication study did not use Study 5 which was not an experiment.  Study 4a was chosen over 4b because it used a more direct manipulation of motivation.

The replication report states the goal of the replication study as replicating two main effects and the interaction.

The results show a main effect of the motivation manipulation, F(1,116) = 71.06, a main effect of anchor precision, F(1,116) = 6.28, but no significant interaction, F < 1.

The data form shows the interaction effect as the main result in the original study, F(1, 55) = 6.88, but the effect is miscoded as a main effect.  The replication result is entered as the main effect for the anchor precision manipulation, F(1, 116) = 6.28 and this significant result is scored as a successful replication of the original study.

However, the key finding in the original article was the interaction effect.  No statistical tests of main effects are reported.

In Experiment 4a, there was a Motivation to Adjust x Scale Unit Width interaction, F(1, 55) = 6.88, prep = .947, omega2 = .02. The difference in the amount of adjustment between the rounded and precise-anchor conditions increased as the motivation to
adjust went from low (Mprecise = -0.76, Mrounded = -0.23, Mdifference = 0.53), F(1, 55) = 15.76, prep = .994, omega2 = .06, to high (Mprecise = -0.04, Mrounded = 0.98, Mdifference = 1.02), F(1, 55) = 60.55, prep =.996, omega2 = .25. 

This leads me to the conclusion that the successful replication of this study is a coding mistake. The critical interaction was not replicated.

 

 

 

 

 

 

Ethical Challenges for Psychological Scientists

Psychological scientists are human and like all humans they can be tempted to violate social norms (Fiske, 2015).  To help psychologists to conduct ethical research, professional organizations have developed codes of conduct (APA).  These rules are designed to help researchers to resist temptations to engage in unethical practices such as fabricate or falsify of data (Pain, Science, 2008).

Psychological science has ignored the problem of research integrity for a long time. The Association for Psychological Science (APS) still does not have formal guidelines about research misconduct (APS, 2016).

Two eminent psychologists recently edited a book with case studies that examine ethical dilemmas for psychological scientists (Sternberg & Fiske, 2015).  Unfortunately, this book lacks moral fiber and fails to discuss recent initiatives to address the lax ethical standards in psychology.

Many of the brief chapters in this book are concerned with unethical behaviors of students, in clinical settings, or ethics of conducting research with animals or human participants.  These chapters have no relevance for the current debates about improving psychological science.  Nevertheless, a few chapter do address these issues and these chapters show how little eminent psychologists are prepared to address an ethical crisis that threatens the foundation of psychological science.

Chapter 29
Desperate Data Analysis by a Desperate Job Candidate Jonathan Haidt

Pursuing a career in science is risky and getting an academic job is hard. After a two-year funded post-doc, I didn’t have a job for one year and I worked hard to get more publications.  Jonathan Haidt was in a similar situation.  He didn’t get an academic job after his first post-doc and was lucky to get a second post-doc,but he needed more publications.

He was interested in the link between feelings of disgust and moral judgments.  A common way to demonstrate causality in experimental social psychology is to use an incidental manipulation of the cause (disgust) and to show that the manipulation has an effect on a measure of the effect (moral judgments).

“I was looking for carry-over effects of disgust”

In the chapter, JH tells readers about the moral dilemma when he collected data and the data analysis showed the predicted pattern, but it was not statistically significant. This means the evidence was not strong enough to be publishable.  He carefully looked at the data and saw several outliers.  He came up with various reasons to exclude some. Many researchers have been in the same situation, but few have told their story in a book.

I knew I was doing this post hoc, and that it was wrong to do so. But I was so confident that the effect was real, and I had defensible justifications! I made a deal with myself: I would go ahead and write up the manuscript now, without the outliers, and while it was under review I would collect more data, which would allow me to get the result cleanly, including all outliers.

This account contradicts various assertions by psychological scientists that they did not know better or that questionable research practices just happen without intent. JH story is much more plausible. He needed publications to get a job. He had a promising dataset and all he was doing was eliminating a few outliers to bet an arbitrary criterion of statistical significance.  So what, if the p-value was .11 with the three cases included. The difference between p = .04 and p = .11 is not statistically significant.  Plus, he was not going to rely on these results. He would collect more data.  Surely, there was a good reason to bend the rules slightly or as Sternberg (2015) calls it going a couple of miles over the speed limit.  Everybody does it.  JH realized that his behavior was unethical, it just was not significantly unethical (Sternberg, 2015).

Decide That the Ethical Dimension Is Significant. If one observes a driver going one mile per hour over the speed limit on a highway, one is unlikely to become perturbed about the unethical behavior of the driver, especially if the driver is oneself.” (Sternberg, 2015). 

So what if JH was speeding a little bit to get an academic job. He wasn’t driving 80 miles in front of an elementary school like Diedrik Stapel, who just made up data.  But that is not how this chapter ends.  JH tells us that he never published the results of this study.

Fortunately, I ended up recruiting more participants before finishing the manuscript, and the new data showed no trend whatsoever. So I dropped the whole study and felt an enormous sense of relief. I also felt a mix of horror and shame that I had so blatantly massaged my data to make it comply with my hopes.

What vexes me about this story is that Jonathan Haidt is known for his work on morality and disgust and published a highly cited (> 2,000 citations in WebofScience) article that suggested disgust does influence moral judgments.

Wheatley and Haidt (2001) manipulated somatic markers even more directly. Highly hypnotizable participants were given the suggestion, under hypnosis, that they would feel a pang of disgust when they saw either the word take or the word often.  Participants were then asked to read and make moral judgments about six stories that were designed to elicit mild to moderate disgust, each of which contained either the word take or the word often. Participants made higher ratings of both disgust and moral condemnation about the stories containing their hypnotic disgust word. This study was designed to directly manipulate the intuitive judgment link (Link 1), and it demonstrates that artificially increasing the strength of a gut feeling increases the strength of the resulting moral judgment (Haidt, 2001, Psychological Review). 

A more detailed report of these studies was published in a few years later (Wheatley & Haidt, 2005).  Study 1 reported a significant difference between the disgust-hypnosis group and the control group, t(44) = 2.41, p = .020.  Study 2 produced a marginally significant result that was significant in a non-parametric test.

For the morality ratings, there were substantially more outliers (in both directions) than in Experiment 1 or for the other ratings in this experiment. As the paired-samples
t test loses power in the presence of outliers, we used its non-parametric analogue, the Wilcoxon signed-rank test, as well (Hollander&Wolfe, 1999). Participants judged the actions to be more morally wrong when their hypnotic word was present (M = 
73.4) than when it was absent (M = 69.6), t(62) = 1.74, p = .09, Wilcoxon Z = 2.18, p < .05. 

Although JH account of his failed study suggests he acted ethically, the same story also reveals that he did have at least one study that failed to provide support for the moral disgust hypothesis that was not mentioned in his Psychological Review article.  Disregarding an entire study that ultimately did not support a hypothesis is a questionable research practice, just as removing some outliers is (John et al., 2012; see also next section about Chapter 35).  However, JH seems to believe that he acted morally.

However, in 2015 social psychologists were well aware that hiding failed studies and other questionable practices undermine the credibility of published findings.  It is therefore particularly troubling that JH was a co-author of another article that failed to mention this study. Schnall, Haidt, Core, and Jordan (2015) responded to a meta-analysis that suggested the effect of incidental disgust on moral judgments is not reliable and that there was evidence for publication bias (e..g, not reporting the failed study JH mentions in his contribution to the book on ethical challenges).  This would have been a good opportunity to admit that some studies failed to show the effect and that these studies were not reported.  However, the response is rather different.

With failed replications on various topics getting published these days, we were pleased that Landy and Goodwin’s (2015) meta-analysis supported most of the findings we reported in Schnall, Haidt, Clore, and Jordan (2008). They focused on what Pizarro, Inbar 
and Helion (2011) had termed the amplification hypothesis of Haidt’s (2001) social intuitionist model of moral judgment, namely that “disgust amplifies moral evaluations—it makes wrong things seem even more wrong (Pizarro et al., 2011, p. 267, emphasis in original).” Like us, Landy and Goodwin (2015) found that the overall effect of incidental disgust on moral judgment is usually small or zero when ignoring relevant moderator variables.”   

Somebody needs to go back in time and correct JH’s Psychological Review article and the hypnosis studies that reported main effects with moderated effect sizes and no moderator effects.  Apparently, even JH doesn’t believe in these effects anymore in 2015 and so it was not important to mention failed studies. However, it might have been relevant to point out that the studies that did report main effects were false positives and what theoretical implications this would have.

More troubling is that the moderator effects are also not robust.  The moderator effects were shown in studies by Schnall and may be inflated by the use of questionable research practices.  In support of this interpretation of her results, a large replication study failed to replicate the results of Schnall et al.’s (2008) Study 3.  Neither the main effect of the disgust manipulation nor the interaction with the personality measure were significant (Johnson et al., 2016).

The fact that JH openly admits to hiding disconfirming evidence, while he would have considered selective deletion of outliers a moral violation, and was ashamed of even thinking about it, suggests that he does not consider hiding failed studies a violation of ethics (but see APA Guidelines, 6th edition, 2010).  This confirms Sternberg’s (2015) first observation about moral behavior.  A researcher needs to define an event as having an ethical dimension to act ethically.  As long as social psychologists do not consider hiding failed studies unethical, reported results cannot be trusted to be objective fact. Maybe it is time to teach social psychologists that hiding failed studies is a questionable research practice that violates scientific standards of research integrity.

Chapter 35
“Getting it Right” Can also be Wrong by Ronnie Janoff-Bulman 

This chapter provides the clearest introduction to the ethical dilemma that researchers face when they report the results of their research.  JB starts with a typical example that all empirical psychologists encountered.  A study showed a promising result, but a second study failed to show the desired and expected result (p > .10).  She then did what many researchers do. She changed the design of the study (a different outcome measure) and collected new data.  There is nothing wrong with trying again because there are many reasons why a study may produce an unexpected result.  However, JB also makes it clear that the article would not include the non-significant results.

“The null-result of the intermediary experiment will not be discussed or mentioned, but will be ignored and forgotten.” 

The suppression of the failed study is called a questionable research practice (John et al., 2012).  The Publication Manual of APA considers this unethical reporting of research results.

JP makes it clear that hiding failed studies undermines the credibility of published results.

“Running multiple versions of studies and ignoring the ones that “didn’t work” can have far-reaching negative effects by contributing to the false positives that pervade our field and now pass for psychological knowledge. I plead guilty.”

JP also explains why it is wrong to neglect failed studies. Running study after study to get a successful outcome, “is likely capitalize on chance, noise, or situational factors and increase the likelihood of finding a significant (but unreliable) effect.” 

This observation is by no means new. Sterling (1959) pointed out that publication bias (publishing only p-values below .05), essentially increases the risk of a false positive result from the nominal level of 5% to an actual level of 100%.  Even evidently false results will produce only significant results in the published literature if failures are not reported (Bem, 2011).

JP asked what can be done about this.  Apparently, JP is not aware of recent developments in psychological science that range from statistical tests that reveal missing studies (like an X-ray for looked file-drawers) to preregistration of studies that will be published without a significance filter.

Although utterly unlikely given current norms, reporting that we didn’t find the effect in a previous study (and describing the measures and manipulations used) would be broadly informative for the field and would benefit individual researchers conducting related studies. Certainly publication of replications by others would serve as a corrective as well.

It is not clear why publishing non-significant results is considered utterly unlikely in 2015, if the 2010 APA Publication Manual mandates publication of these studies.

Despite her pessimism about the future of Psychological Science, JP has a clear vision how psychologists could improve their science.

A major, needed shift in research and publication norms is likely to be greatly facilitated by an embrace of open access publishing, where immediate feedback, open evaluations and peer reviews, and greater communication among researchers (including replications and null results) hold the promise of opening debate and discussion of findings. Such changes would help preclude false-positive effects from becoming prematurely reified as facts; but such changes, if they are to occur, will clearly take time.

The main message of this chapter is that researchers in psychology have been trained to chase significance because obtaining statistical significance by all means was considered a form of creativity and good research (Sternberg, 2018).  Unfortunately, this is wrong. Statistical significance is only meaningful if it is obtained the right way and in an open and transparent manner.

33 Commentary to Part V Susan T. Fiske

It was surprising to read Fiske’s (2015) statement that “contrary to human nature, we as scientists should welcome humiliation, because it shows that the science is working.”

In marked contrast to this quote, Fiske has attacked psychologists who are trying to correct some of the errors in published articles as “method terrorists

I personally find both statements problematic. Nobody should welcome humiliation and nobody who points out errors in published articles is a terrorist.  Researchers should simply realize that publications in peer-reviewed journals can still contain errors and that it is part of the scientific process to correct these errors.  The biggest problem in the past seven years was not that psychologists made mistakes, but that they resisted efforts to correct them that arise from a flawed understanding of the scientific method.

36 Commentary to Part VI Susan T. Fiske

Social psychologists have justified not reporting failed study (cf. Jonathan Haidt example) by calling these studies pilot studies (Bem, 2011).  Bem pointed out that social psychologists have a lot of these pilot studies.  But a pilot study is not a study that tests the cause effect relationship. A pilot study tests either whether a manipulation is effective or whether a measure is reliable and valid.  It is simply wrong to treat studies that test the effect of a manipulation on an outcome a pilot study, if the study did not work.

“However, few of the current proposals for greater transparency recommend describingeach and every failed pilot study.”

The next statement makes it clear that Fiske conflates pilot studies with failed studies.

As noted, the reasons for failures to produce a given result are multiple, and supporting the null hypothesis is only one explanation. 

Yes, but it is one plausible explanation and not disclosing the failure renders the whole purpose of empirical hypothesis testing irrelevant (Sterling, 1959).

“Deciding when one has failed to replicate is a matter of persistence and judgment.”

No it is not. Preregister the study and if you are willing to use a significant result if you obtain it, you have to report the non-significant result if you do not. Everything else is not science and Susan Fiske seems to lack an understanding of the most basic reason for conducting an experiment.

What is an ethical scientist to do? One resolution is to treat a given result – even if it required fine-tuning to produce – as an existence proof: This result demonstrably can occur, at least under some circumstances. Over time, attempts to replicate will test generalizability.

This statement ignores that the observed pattern of results is heavily influenced by sampling error, especially in the typical between-subject design with small samples that is so popular in experimental social psychology.  A mean difference between two groups does not mean that anything happened in this study. It could just be sampling error.  But maybe the thought that most of the published results in experimental social psychology are just errors is too much to bear for somebody at the end of her career.

I have followed the replication crisis unfold over the past seven years since Bem (2011) published the eye-opening, ridiculous claims about feeling the future location of randomly displayed erotica. I cannot predict random events in the future, but I can notice trends and I do have a feeling that the future will not look kindly on those who tried to stand in the way of progress in psychological science. A new generation of psychologists is learning everyday about replication failures and how to conduct better studies.  For old people there are only two choices. Step aside or help them to learn from the mistakes of the older generation.

P.S. I think there is a connection between morality and disgust but it (mainly) goes from immoral behaviors to disgust.  So let me tell you, psychological science, Uranus stinks.

Estimating Reproducibility of Psychology (No. 136): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The authors of this article are prominent figures in the replication crisis of social psychology.  Kathleen Vohs was co-author of a highly criticized article that suggested will-power depends on blood-glucose levels. The evidence supporting this claim has been challenged for methodological and statistical reasons (Kurzban, 2010; Schimmack, 2012).  She also co-authored numerous articles on ego-depletion that are difficult to replicate (Schimmack, 2016; Inzlicht, 2016).  In a response with Roy Baumeister titled “A misguided effort with elusive implications” she dismissed these problems,b ut her own replication project produced very similar results (SPSP, 2018). Some of her social priming studies also failed to replicate (Vadillo, Hardwicke, & Shanks, 2016). 

Vohs

The z-curve plot for Vohs shows clear evidence that her articles contain too many significant results (75% inc. marginally significant one’s with only 55% average power).  The average probability of successfully replicating a randomly drawn finding form Vohs’ articles is 55%. However, this average is obtained with substantial heterogeneity.  Just significant z-scores (2 to 2.5) have only an average estimated replicability of 33%. Even z-scores in the range from 2.5 to 3 have only an average replicability of 40%.   This suggests that p-values in the range between .05 and .005 are unlikely to replicate in exact replication studies.

In a controversial article with the title “The Truth is Wearing Off“, Jonathan Schooler even predicted that replication studies might often fail.  The article was controversial because Schooler suggested that all effects diminish over time (I wish this were true for the effect of eating chocolate on weight, but so far it hasn’t happened).  Schooler is also known for an influential article about “verbal overshadowing” in eyewitness identifications.  Francis (2012) demonstrated that the published results were too good to be true and the first Registered Replication Report failed to replicate on of the five studies and replicated another one only with a much smaller effect size.

Schooler.png

The z-curve plot for Schooler looks very different. The average estimated power is higher.  However, there is a drop at z = 2.6 that is difficult to explain with a normal sampling distribution.

Based on this context information, predictions about replicabilty depend on the p-values of the actual studies.  Just significant p-values are unlikely to replicate but larger p-values might replicate.

Summary of Original Article

The article examines moral behavior.  The main hypothesis is that beliefs about free will vs. determinism influence cheating.  Whereas belief in free will encourages moral behavior,  beliefs that behavior is determined make it easier to cheat.

Study 1

30 students were randomly assigned to one of two conditions (n = 15).

Participants in the anti-free-will condition, read a passage written by Francis Crick, a Noble Laureate, suggesting that free will is an illusion.  In the control condition, they read about consciousness.

Then participants were asked to work on math problems on a computer. They were given a cover story that the computer program had a glitch and would present the correct answers, but they could fix this problem by pressing the space bar as soon as the question appeared.  They were asked to do so and to try to solve the problems on their own.

It is not mentioned whether participants were probed for suspicion and data from all participants were included in the analysis.

The main finding was that participants cheated more in the “no-free-will” condition than in the control condition, t(28) = 3.04, p = .005.

Study 2

Study 2 addressed several limitations of Study 1. Although the sample size was larger, the design included 5 conditions (n = 24/25 per condition).

The main dependent variable was the number of correct answers on 15 reading comprehension, mathematical, and logic problems that were used by Vohs in a previous study (Schmeichel, Vohs, & Baumeister, 2003).  For each correct answer, participants received $1.

Two conditions manipulate free will beliefs, but participants could not cheat. The comparison of these two conditions shows whether the manipulation influences actual performance, but there was no major difference (based on Figure $7.50 control vs. $7 no-free-will).

In the cheating condition, experimenters received a fake phone call, told the participants that they had to leave and that the participant should continue, score their answers and pay themselves.  Surprisingly, neither the free-will, nor the neutral condition showed any signs of cheating ($7.20 & 7.30, respectively).  However, the determinism condition increased the average pay-out to $10.50.

One problem for the statistical analysis is that the researchers “did not have participants’ answer sheets in the three self-paid conditions; therefore, we divided the number of $1 coins taken by each group by the number of group members to arrive at an average self-payment” (p. 52).

The authors then report a significant ANOVA result, F(4, 114) = 5.68, p = .0003.

However, without information about the standard deviation in each cell, it is not possible to compute an Analysis of Variance.  This part of the analysis is not explained in the article.

Replication 

The replication team also had some problems with Study 2.

We originally intended to carry out Study 2, following the Reproducibility Project’s system of working from the back of an article. However, on corresponding with the authors we discovered that it had arisen in post-publication correspondence about analytic methods that the actual effect size found was smaller than reported, although the overall conclusion remained the same. 

As a result, they decided to replicate Study 1.

The sample size of the replication study was near twice as large as the sample of the original study (N = 58 vs. 30).

The results did not replicate the significant result of the original study, t(56) = 0.77, p = .44.

Conclusion

Study 1 was underpowered.  Even nearly doubling the sample size was not sufficient to obtain significance in the replication study.   Study 2 was superior, but it was reported so poorly that the replication team could not replicate the study.