Category Archives: Uncategorized

Confused about statistics? Read more Cohen (and less Loken & Gelman)

*** Background.  The Loken and Gelman article “Measurement Error and the Replication Crisis” created a lot of controversy in the Psychological Methods Discussion Group. I believe the article is confusing and potentially misleading. For example, the authors do not clearly distinguish between unstandardized and standardized effect size measures, although random measurement error has different consequences for one or the other.  I think a blog post by Gelman makes it clear what the true purpose of the article. is.

We talked about why traditional statistics are often counterproductive to research in the human sciences.

This explains why the article tries to construct one more fallacy in the use of traditional statistics, but fails to point out a simple solution to avoid this fallacy.  Moreover, I argue in the blog post that Loken and Gelman committed several fallacies on their own in an attempt to discredit t-values and significance testing.

I asked Gelman to clarify several statements that made no sense to me. 

 “It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal-to-noise level is high”  (Loken and Gelman)

Ulrich Schimmack
Would you say that there is no meaningful difference between a z-score of 2 and a z-score of 4? These z-scores are significantly different from each other. Why would we not say that a study with a z-score of 4 provides stronger evidence for an effect than a study with a z-score of 2?

  • Andrew says:

    Ulrich:

    Sure, fair enough. The z-score provides some information. I guess I’d just say it provides less information than people think.

 

I believe that the article contains many more statements that are misleading and do not inform readers how t-values and significance testing works.  Maybe the article is not as bad as I think it is, but I am pretty sure that it provides less information than people think.

In contrast, Jacob Cohen has  provided clear and instructive recommendations for psychologists to improve their science.  If psychologists had listened to him, we wouldn’t have a replication crisis.

The main points to realize about random measurement error and replicability are.

1.  Neither population nor sample mean differences (or covariances) are effect sizes. They are statistics that provide some information about effects and the magnitude of effects.  The main problem in psychology has been the interpretation of mean differences in small samples as “observed effect sizes”  Effects cannot be observed.

2.  Point estimates of effect sizes vary from sample to sample.  It is incorrect to interpret a point estimate as information about the size of an effect in a sample or a population. To avoid this problem, researchers should always report a confidence interval of plausible effect sizes. In small samples with just significant results these intervals are wide and often close to zero.  Thus, no research should interpret a moderate to large point estimate, when effect sizes close to zero are also consistent with the data.

3.  Random measurement creates more uncertainty about effect sizes.  It has no systematic effect on unstandardized effect sizes, but it systematically lowers standardized effect sizes (correlations, Cohen’s d amount of explained variance).

4.  Selection for significance inflates standardized and unstandardized effect size estimates.  Replication studies may fail if original studies were selected for significance, depending on the amount of bias introduced by selection for significance (this is essentially regression to the mean).

5. As random measurement error attenuates standardized effect sizes,  selection for significance partially corrects for this attenuation.  Applying a correction formula (Spearman) to estimates after selection for significance would produce even more inflated effect size estimates.

6.  The main cause of the replication crisis is undisclosed selection for significance.  Random measurement error has nothing to do with the replication crisis because random measurement error has the same effect on original and replication studies. Thus, it cannot explain why an original study was significant and a replication study failed to be significant.

Questionable Claims in Loken and Gelman’s  Backpack article. 

If you learned that a friend had run a mile in 5 minutes, you would be respectful; if you learned she had done it while carrying a heavy backpack, you would be awed. The obvious inference is that she would have been even faster without the backpack.

This makes sense. We assume that our friends’ ability is a relatively fixed ability,  everybody is slower with a heavy backpack, and the distance is really a mile, the clock was working properly, and no magic potion or tricks are involved.  As a result, we expect very little variability in our friends’ performance and an even faster time without the backpack.

But should the same intuition always be applied to research findings? Should we assume that if statistical significance is achieved in the presence of measurement error, the associated effects would have been stronger without noise?

How do we translate this analogy?  Let’s say running 1 mile in 5 minutes corresponds to statistical significance. Any time below 5 minutes is significant and any time longer than 5 minutes is not significant.  The friends’ ability is the sample size. The lager the sample size, the easier it is to get a significant result.  Finally, the backpack is measurement error.  Just like a heavy backpack makes it harder to run 1 mile in 5 minutes, more measurement error makes it harder to get significance.

The question is whether it follows that the “associated effects” (mean difference or regression coefficient that are used to estimate effect sizes) would have been stronger without random measurement error?

The answer is no.  This may not be obvious, but it directly follows from basic introductory statistics, like the formula for the t-statistic.

t-value  =  mean.difference / SD * sqrt(N)/2

and SD reflects the variability of a construct in the population plus additional variability due to measurement error.  So, measurement error increases the SD component of the t-value, but it has no effect on the effect size.

We caution against the fallacy of assuming that that which does not kill statistical significance makes it stronger. 

With all due respect for trying to make statistics accessible, there is a trade-off between accessibility and sensibility.   First, statistical significance cannot be made stronger. A finding is either significant or it is not.  Surely a test-statistic like a t-value can be (made) stronger or weaker depending on changes in its components.  If we interpret “that which does not kill” as “obtaining a significant result with a lot of random measurement error” it is correct to expect a larger t-value and stronger evidence against the null-hypothesis in a study with a more reliable measure.  This follows directly from the effect of random error on the standard deviation in the denominator of the formula. So how can it be a fallacy to assume something that can be deduced from a mathematical formula? Maybe the authors are not talking about t-values.

It is understandable, then, that many researchers have the intuition that if they manage to achieve statistical significance under noisy conditions, the observed effect would have been even larger in the absence of noise.  As with the runner, they assume that without the burden—that is, uncontrolled variation—their effects would have been even larger.

Although this statement makes it clear that the authors are not talking about t-values, it is not clear why researchers should have the intuition that a study with a more reliable measure should produce larger effect sizes.  As shown above, random measurement error adds to the variability of observations, but it has no systematic effect on the mean difference or regression coefficient.

Now the authors introduce a second source of bias. Unlike random measurement error, this error is systematic and can lead to inflated estimates of effect sizes.

The reasoning about the runner with the backpack fails in noisy research for two reasons. First, researchers typically have so many “researcher degrees of freedom”—unacknowledged choices in how they prepare, analyze, and report their data—that statistical significance is easily found even in the absence of underlying effects and even without multiple hypothesis testing by researchers. In settings with uncontrolled researcher degrees of freedom, the attainment of statistical significance in the presence of noise is not an impressive feat.

The main reason for inferential statistics is to generalize results from a sample to a wider population.  The problem of these inductive inferences is that results in a sample vary from sample to sample. This variation is called sampling error.  Sampling error is separate from measurement error and even studies with perfect measures have sampling error and sampling error is inversely related to sample size (2/sqrt(N)).  Sampling error alone is again unbiased. It can produce larger mean differences or smaller mean differences.  However,  if studies are split into significant studies and non-significant studies,  mean differences of significant results are inflated – and mean differences of non-significant results are deflated estimates of the population mean difference.  So, effect size estimates in studies that are selected for significance are inflated. This is true, even in studies with reliable measures.

In a study with noisy measurements and small or moderate sample size, standard errors will be high and statistically significant estimates will therefore be large, even if the underlying effects are small.

To give an example,  assume there were a height difference of 1 cm between brown eyed and blue eyed individuals.  The standard deviation of height is 10 cm.  A study with 400 participants has a sampling error of 10 / sqrt(400)/2 cm  = 1 cm.  To achieve significance, the effect size has to be about twice as larger as the sampling error (t = 2 ~ p = .05).  Thus, a significant result requires a mean difference of 2 cm, which is 100% larger than the population mean difference in height.

Another researcher uses an unreliable measure (25% reliability) of height that quadruples the variance (100 cm^2 vs. 400 cm^2) and doubles the standard deviation (10cm vs. 20cm).  The sampling error also doubles to 2 cm, and now a mean difference of 4 cm is needed to achieve significance with the same t-value of 2 as in the study with the perfect measure.

The mean difference is two times larger than before and four times larger than the mean difference in the population.

The fallacy would be to look at this difference of 4 cm and to believe that an even larger difference could have been obtained with a more reliable measure.   This is a fallacy, but not for the reasons the authors suggest.  The fallacy is to assume that random measurement error in the measure of height reduced the estimate of 4cm and that an even bigger difference would be obtained with a more reliable measure.  This is a fallacy because random measurement error does not influence the mean difference of 4cm.  Instead,  it increased the standard deviation and with a more reliable measure the standard deviation would be smaller (1 cm) and the mean difference of 4 cm would have a t-value of 4 rather than 2, which is significantly stronger evidence for an effect.

How can the authors overlook that random measurement error has no influence on mean differences?  The reason is that they do not clearly distinguish between standardized and unstandardized estimates of effect sizes.

Spearman famously derived a formula for the attenuation of observed correlations due to unreliable measurement. 

Spearman’s formula applies to correlation coefficients and correlation coefficients are standardized measures of effect sizes because the covariance is divided by the standard deviations of both variables.  Similarly Cohen’s d is a standardized coefficient because the mean difference is divided by the pooled standard deviation of the two groups.

Random measurement error does clearly influence standardized effect size estimates because the standard deviation is used to standardized effect sizes.

The true population mean difference of 1 cm divided by the population standard deviation  of 10 cm yields a Cohen’s d = .10; that is one-tenth of a standard deviation difference.

In the example, the mean difference for a just significant result with a perfect measure was 2 cm, which yields a Cohen’s d = 2 cm divided by 10 cm = .2,  two-tenth of a standard deviation.

The mean difference for a just significant result with a noisy measure was 4 cm, which yields a standardized effect size of 4 cm divided by 20cm = .20, also two-tenth of a standard deviation.

Thus, the inflation of the mean difference is proportional to the increase in the standard deviation.  As a result, the standardized effect size is the same for the perfect measure and the unreliable measure.

Compared to the true mean difference of one-tenth of a standard deviation, the standardized effect sizes are both inflated by the same amount (d = .20 vs. d = .10, 100% inflation).

This example shows the main point the authors are trying to make.  Standardized effect size estimates are attenuated by random measurement error. At the same time, random measurement error increases sampling error and the mean difference has to be inflated to get significance.  This inflation already corrects for the attenuation of standardized effect sizes and any additional corrections for unreliabilty with the Spearman formula would inflate effect size estimates rather than correcting for attenuation.

This would have been a noteworthy observation, but the authors suggest that random measurement error can even have paradox effects on effect size estimates.

But in the small-N setting, this will not hold; the observed correlation can easily be larger in the presence of measurement error (see the figure, middle panel).

This statement is confusing because the most direct effect of measurement error on standardized effect sizes is attenuation.  In the height example, any observed mean difference is divided by 20 rather than 10, reducing the standardized effect sizes by 50%. The variability of these standardized effect sizes is simply a function of sample size and therefore equal.  Thus, it is not clear how a study with more measurement error can produce larger standardized effect sizes.  As demonstrated above, the inflation produced by the significance filter at most compensates for the deflation due to random measurement error.  There is simply no paradox that researchers can obtain stronger evidence (larger t-values or larger standardized effect sizes) with nosier measures even if results are selected for significance.

Our concern is that researchers are sometimes tempted to use the “iron law” reasoning to defend or justify surprisingly large statistically significant effects from small studies. If it really were true that effect sizes were always attenuated by measurement error, then it would be all the more impressive to have achieved significance.

This makes no sense. If random measurement error attenuates effect sizes, it cannot be used to justify surprisingly large mean differences.  Either we are talking about unstandardized effect sizes and they are not influenced by measurement error or we are talking about standardized effect sizes and those are attenuated by measurement error and so obtaining large mean differences is surprising.  If the true mean difference is 1 cm and an effect of 4 cm is needed to get significance with SD = 20 cm, it is surprising to get significance because the power to do so is only 17%.  Of course, it is only surprising if we knew that the population effect size is only 1 cm, but the main point is that we cannot use random measurement error to justify large effect sizes because random measurement error always attenuates standardized effect size estimates.

More confusing claims follow.

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

As explained above, random measurement error makes t-values weaker not stronger. It therefore makes no sense to attribute strong t-values to random measurement error as a potential source of variance.  The most likely explanation for strong effect sizes in studies with large sampling error is selection for significance, not random measurement error.

After all of these confusing claims the authors end with a key point.

A key point for practitioners is that surprising results from small studies should not be defended by saying that they would have been even better with improved measurement.

This is true because it is not a logical argument and not an argument researchers actually make.  The bigger problem is that researchers do not realize that their significance filter makes it necessary to find moderate to large effects and that sampling error in small samples alone can produce these effect sizes, especially when questionable research practices are being used.  No claims about hypothetically larger effect sizes are necessary or regularly made.

Next the authors simply make random statements about significance testing that reveal their ideological bias rather than adding to the understanding of t-values.

It is a common mistake to take a t-ratio as a measure of strength of evidence and conclude that just because an estimate is statistically significant, the signal to-
noise level is high. 

Of course, the t-value is a measure of the strength of evidence against the null-hypothesis, typically the hypothesis that the data were obtained without a mean difference in the population.  The larger the t-value, the less likely it is that the observed t-value could have been obtained without a population mean difference in the direction of the mean difference in the sample.  And with t-values of 4 or higher, published results also have a high probability of replicating a significant result in a replication study (Open Science Collaboration, 2015).  It can be debated whether a t-value of 2 is weak, moderate or strong evidence, but it is not debatable whether t-values provide information that can be used for inductive inferences.  Even Bayes-Factors rely on t-values.  So, the authors’ criticism of t-values makes little sense from any statistical perspective.

It is also a mistake to assume that the observed effect size would have been even larger if not for the burden of measurement error. Intuitions that are appropriate when measurements are precise are sometimes misapplied in noisy and more
probabilistic settings.

Once more these broad claims are false and misleading.  Everything else equal, estimates of standardized effect sizes are attenuated by random measurement error and would be larger if a more reliable measure had been used.  Once selection for significance is present,  the inflation introduced by selection for significance inflates standardized effect size estimates for perfect measures and it starts to disattenuate standardized effect size estimates with unreliable measures.

In the end, the authors try to link their discussion of random measurement error to the replication crisis.

The consequences for scientific replication are obvious. Many published effects
are overstated and future studies, powered by the expectation that the effects can be
replicated, might be destined to fail before they even begin. We would all run faster
without a backpack on our backs. But when it comes to surprising research findings
from small studies, measurement error (or other uncontrolled variation) should not be
invoked automatically to suggest that effects are even larger.

This is confusing. Replicability is a function of power and power is a function of the population mean difference and the sampling error of the design of a study.  Random measurement error increases sampling error, which reduces standardized effect sizes, power, and replicability.  As a result, studies with unreliable measure are less likely to produce significant results in original studies and in replication studies.

The only reason for surprising replication failures (e.g., 100% significant original studies and 25% significant replication studies for social psychology; OSC, 2015) are questionable practices that inflate the percentage of significant results in original studies.  It is irrelevant whether the original result was produced with a small population mean difference and a reliable measure or with a moderate population mean difference and an unreliable measure.  It only matters how strong the mean difference for the measure that was used is.  That is, replicability is the same for a height difference of 1 cm with a perfect measure and a standard deviation of 10 cm or a height difference of  2 cm and a noisy measure with a standard deviation of 20 cm.  However, the chance of obtaining a significant result in a study if the mean difference is 1 cm and the SD is 20 cm is lower because the noisy measure reduces the standardized effect size to Cohen’s d  = 1 cm / 20 cm = 0.05.

Conclusion

Loken and Gelman wrote a very confusing article about measurement error.  Although confusion about statistics is the norm among social scientists, it is surprising that a statistician has problems to explain basic statistical concepts and how they relate to the outcome of original and replication studies.

The most probable explanation for the confusion is that the authors seem to be believe that the combination of random measurement error and large sampling error creates a novel problem that has been overlooked.

Measurement error and selection bias thus can combine to exacerbate the replication crisis.

In the large-N scenario, adding measurement error will almost always reduce the observed correlation.  Take these scenarios and now add selection on statistical significance… for smaller N, a fraction of the observed effects exceeds the original. 

If researchers focus on getting statistically significant estimates of small effects, using noisy measurements and small samples, then it is likely that the additional sources of variance are already making the t test look strong.

“Of statistically significant effects observed after error, a majority could be greater than in the “ideal” setting when N is small”

The quotes suggest that the authors believe something extraordinary is happening in studies with large random measurement error and small samples.  However, this is not the case. Random measurement error attenuates t-values and selection for significance inflates them and these two effects are independent.  There is no evidence to suggest that random measurement error suddenly inflates effect size estimates in small samples with or without selection for significance.

Recommendations for Researchers 

It is also disconcerting that the authors fail to give recommendations how researchers can avoid fallacies, while those recommendations have been made before and would easily fix the problems associated with interpretation of effect sizes in studies with noisy measures and small samples.

The main problem in noisy studies is that point estimates of effect sizes are not a meaningful statistic.   This is not necessarily a problem Many exploratory studies in psychology aim to examine whether there is an effect at all and whether this effect is positive or negative.  A statistically significant result only allows researchers to infer that a positive or negative effect contributed to the outcome of the study (because the extreme t-value falls into a range of values that are unlikely without an effect). So, conclusions should be limited to discussion of the sign of the effect.

Unfortunately, psychologists have misinterpreted Jacob Cohen’s work and started to interpret standardized coefficients like correlation coefficients or Cohen’s d that they observed in their samples.  To make matters worse these coefficients are sometimes called observed effect sizes, as in the article by Loken and Gelman.

This might have been a reasonable term for trained statisticians, but for poorly trained psychologists it suggested that this number tells them something about the magnitude of the effect they were studying.  After all, this seems a reasonable interpretation of the term “observed effect size.”  They then used Cohen’s book to interpret these values as evidence that they obtained a small, moderate, or large effect.  In small studies, the effects have to be moderate (2 groups, n = 20, p = .05 => d = .64) to reach significance.

However, Cohen explicitly warned against this use of effect sizes. He developed standardized effect size measures to help researchers to plan studies that can provide meaningful tests of hypotheses.  A small effect size requires a large sample.  His effect sizes were develop to help researchers to plan studies. If they think an effect is small, they shouldn’t run a study with 40 participants because the study is so noisy that it is likely to fail.  So, standardized effect sizes were intended to be assumptions about unobservable population parameters.

However, psychologists ignored Cohen’s guidelines for the planning of studies. Instead they used his standardized effect sizes to examine how strong the “observed effects” in their studies were.  The misintepretation of Cohen is partially responsible for the replication crisis because researchers ignored the significance filter and were happy to report that they consistently observed moderate to large effect sizes.

However, they also consistently observed replication failures in their labs.  This was puzzling because moderate to large effects should be easy to replicate.  However, without training in statistics, social psychologists found an explanation for this variability of observed effect sizes as well: surely, the variability in observed effect sizes (!) from study to study meant that their results were highly dependent on context.  I still remember joking with some other social psychologists that effects even dependent on the color of research assistants’ shirts.  Only after reading Cohen did I understand what was really happening.  In studies with large sampling error, the “observed effect sizes” move around a lot because they are not observations of effects.  Most of the variation is mean differences from study to study is purely random sampling error.

At the end of his career, Cohen seemed to have lost faith in psychology as a science.  He wrote a dark and sarcastic article titled “The Earth is Round, p < .05.”  In this article, he proposes a simple solution for misinterpretation of “observed effect sizes” in small samples.  The abstract of this article is more informative and valuable than Loken and Gelman’s entire article.

Exploratory data analysis and the use of graphic methods, a steady improvement in
and a movement toward standardization in measurement, an emphasis on estimating effect sizes using confidence intervals, and the informed use of available statistical
methods is suggested. For generalization, psychologists must finally rely, as has been done in all the older sciences,

The key point is that any sample statistic like an “effect size estimate” (not an observed effect size) has to be considered in the context of the precision of the estimate.  Nobody would take a public opinion poll seriously if it were conducted with 40 respondents and the result was a 55% chance of a candidate winning an election if this information were provided with the information that the 95%CI  ranges from 40% to 70%.

The same is true for tons of articles that reported effect size estimates without confidence intervals.  For studies with just significant results this is not a problem because significance translates into a confidence interval that does not contain the value specified by a null-hypothesis; typically zero.  For a just significant result this means that the boundary of the CI is close to zero.  So, researchers are justified in interpreting the result as evidence about the sign of an effect, but the effect size is uncertain.  Nobody would rush to buy stocks in a drug company, if they report that their new drug had an effectiveness of extending life expectancy by 1 day up to 3 years.  But if we are mislead in focusing on an observed effect size of 1.5 years, we might be foolish enough to invest in the company and lose some money.

In short, noisy studies with unreliable measures and wide confidence intervals cannot be used to make claims about effect sizes.   The reporting of standardized effect size measures can be useful for meta-analysis or to help future research in the planning of their studies, but researchers should never interpret their point estimates as observed effect sizes.

Final Conclusion

Although mathematics and statistics are fundamental sciences for all quantitative, empirical sciences each scientific discipline has its own history, terminology, and unique challenges.  Political science differs from psychology in many ways.  On the one hand, political science has access to large representative samples because there is a lot of interest in those kind of data and a lot of money is spent on collecting these data.  These data make it possible to obtain relatively precise estimates. The downside is that many data are unique to a historic context. The 2016 election in the United States cannot be replicated.

Psychology is different.  Research budgets and ethics often limit sample sizes.  However, within-subject designs with many repeated measures can increase power, something political scientists cannot do.  In addition, studies in psychology can be replicated because the results are less sensitive to a particular historic context (and yes, there are many replicable findings in psychology that generalize across time and culture).

Gelman knows about as much about psychology as I know about political science. Maybe his article is more useful for political scientists, but psychologists would be better off if they finally recognized the important contribution of one of their own methodologist.

To paraphrase Cohen: Sometimes reading less is more, except for Cohen.

Advertisements

Fritz Strack asks “Have I done something wrong”

Since 2011, experimental social psychology is in crisis mode. It has become evident that social psychologists violated some implicit or explicit norms about science. Readers of scientific journals expect that the methods and results section of a scientific article provide an objective description of the study, data analysis, and results.  However, it is now clear that this is rarely the case.  Experimental social psychologists have used various questionable research practices to report mostly results that supported their theories . As a result, it is now unclear which published results are replicable and which results are not.

In response to this crisis of confidence, a new generation of social psychologists has started to conduct replication studies.  The most informative replication studies are published in a new type of article called registered replication reports (RRR)

What makes RRRs so powerful is that they are not simple replication studies.  An RRR is a collective effort to replicate an original study in multiple labs.  This makes it possible to examine generalizability of results across different populations and it makes it possible to combine the data in a meta-analysis.  The pooling of data across multiple replication studies reduces sampling error and it becomes possible to obtain fairly precise effect size estimates that can be used to provide positive evidence for the absence of an effect.  If the effect size estimate is within a reasonably small interval around zero, the results suggest that the population effect size is so close to zero that it is theoretically irrelevant. In this way, an RRR can have three possible results: (a) it replicates an original result in most of the individual studies (e.g., with 80% power, it would replicate the result in 80% of the replication attempts); (b) it fails to replicate the result in most of the replication attempts (e.g., it replicates the result in only 20% of replication studies), but the effect size in the meta-analysis is significant, or (c) it fails to replicate the original result in most studies and the meta-analytic effect size estimates suggests the effect does not exist.

Another feature of RRRs is that original authors get an opportunity to publish a response.  This blog post is about Fritz Strack’s response to the RRR of Strack et al.’s facial feeback study.  Strack et al. (1988) reported two studies that suggested incidental movement of facial muscles influences amusement in response to ratings of cartoons.  The article is the second most cited article by Strack and the most cited empirical article. It is therefore likely that Strack cared about the outcome of the replication study.

Strack

So, it may have elicited some negative feelings when the results showed that none of the replication studies produced a significant result and the meta-analysis suggested that the original result was a false positive result; that is the population effect size is close to zero and the results of the original studies were merely statistical flukes.

FFB.meta.png

Strack’s Response to the RRR

Strack’s first response to the RRR results was surprise because numerous studies had conducted replication studies of this fairly famous study before and the published studies typically, but not always, reported successful replications.  Any naive reader of the literature, review articles, or textbook is likely to have the same response.  If an article has over 600 citations, it suggests that it made a solid contribution to the literature.

However, social psychology is not a normal psychological science.  Even more famous effects like the ego-depletion effect or elderly priming have failed to replicate.  A major replication attempted that was published in 2015 showed that only 25% of social psychological studies could be replicated (OSC, 2015); and Strack had commented on this result. Thus, I was a bit surprised by Strack’s surprise because the failure to replicate his results was in line with many other replication failures since 2011.

Despite concerns about the replicabilty of social psychology, Strack expected a positive result because he had conducted a meta-analysis of 20 studies that had been published in the past five years.

If 20 previous studies successfully replicated the effect and the 17 studies of the RRR all failed to replicate the effect, it suggests the presence of a moderator; that is some variation between these two sets of studies that explains why the nonRRR studies found the effect and the RRR studies did not.

Moderator 1

First, the authors have pointed out that the original study is “commonly discussed in introductory psychology courses and textbooks” (p. 918). Thus, a majority of psychology students was assumed to be familiar with the pen study and its findings.

As the study used deception, it makes sense that the study does not work if students know about the study.  However, this hypothesis assumes that all 17 samples in the RRR were recruited from universities in which the facial feedback hypothesis was taught before they participated in the study.  Moreover, it assumes that none of the samples in the successful nonRRR studies had the same problem.

However, Strack does not compare nonRRR studies to RRR studies.  Instead, he focuses on three RRR samples that did not use student samples (Holmes, Lynott, and Wagenmakers).  None of the three samples individually show a significant result. Thus, none of these studies replicated the original findings.  Strack conducts a meta-analysis of these three studies and finds an average mean difference of d = .16.

Table 1

Dataset N M-teeth M-lips SD-teeth SD-lips Cohen’s d
Holmes 130 4.94 4.79 1.14 1.3 0.12
Lynott 99 4.91 4.71 1.49 1.31 0.14
Wagenmakers 126 4.54 4.18 1.42 1.73 0.23
Pooled 355 4.79 4.55 1.81 2.16 0.12

The standardized effect size of d = .16 is the unweighted average of the three d-values in Table 1.  The weighted average is d = .17.  However, if the three studies are first combined into a single dataset and the standardized mean difference is computed from the combined dataset, the standardized mean difference is only d = .12

More importantly, the standard error for the pooled data is 2 / sqrt(355) = 0.106, which means the 95% confidence interval around any of these point estimates is 0.106 * 1.96 = .21 standard deviations wide on each side of the point estimate.  Even with d = .17, the 95%CI (-.04 to .48) includes 0.  At the same time, the effect size in the original study was approximately d ~ .5, suggesting that the original results were extreme outliers or that additional moderating factors account for the discrepancy.

Strack does something different. He tests the three proper studies against the average effect size of the “improper” replication studies with student samples.  These studies have an average effect size of d = -0.03.  This analysis shows a significant difference  (proper d = .16 and improper d = -.03) , t(15) = 2.35, p = .033.

This is an interesting pattern of results. The significant moderation effect suggests that that facial feedback effects were stronger in the 3 studies identified by Strack than in the remaining 14 studies.  At the same time, the average effect size of the three proper replication studies is still not significant, despite a pooled sample size that is three times larger than the sample size in the original study.

One problem for this moderator hypothesis is that the research protocol made it clear that the study had to be conducted before the original study was covered in a course (Simons response to Strack).  However, maybe student samples fail to show the effect for another reason.

The best conclusion that can be drawn from these results is that the effect may be greater than zero, but that the effect size in the original studies was probably inflated.

Cartoons were not funny

Strack’s second concern is weaker and he is violating some basic social psychological rules about argument strength. Adding weak arguments increases persuasiveness if the audience is not very attentive, but they backfire with an attentive audience.

Second, and despite the obtained ratings of funniness, it must be asked if Gary Larson’s The Far Side cartoons that were iconic for the zeitgeist of the 1980s instantiated similar psychological conditions 30 years later. It is indicative that one of the four exclusion criteria was participants’ failure to understand the cartoons.  

One of the exclusion criteria was failure to understand the cartoons, but how many participants were excluded because of this criterion?  Without this information, this is clearly not relevant and if it was an exclusion criterion and only participants who understood the cartoons were used it is not clear how this could explain the replication failure.  Moreover, Strack just tried to claim that proper studies did show the effect, which they could not show if the cartoons were not funny.  Finally, the means clearly show that participants reported being amused by the cartoons.

Weak arguments like this undermine more reasonable arguments like the previous one.

Using A Camera 

Third, it should be noted that to record their way of holding the pen, the RRR labs deviated from the original study by directing a camera on the participants. Based on results from research on objective self-awareness, a camera induces a subjective self-focus that may interfere with internal experiences and suppress emotional responses.

This is a serious problem and in hindsight the authors of the RRR are probably regretting the decision to use cameras.  A recent article actually manipulated the presence or absence of a camera and found stronger effects without a camera, although the predicted interaction effect was not significant. Nevertheless, the study suggested that creating self-awareness with a camera could be a moderator and might explain why the RRR studies failed to replicate the original effect.

Reverse p-hacking 

Strack also noticed a statistical anomaly in the data. When he correlated the standardized mean differences (d-values) with the sample sizes, a non-significant positive correlation emerged, r(17) = .45, p = .069.   This correlation shows a statistical trend for larger samples to produce larger effect sizes.  Even if this correlation were significant, it is not clear what conclusions should be drawn from this observation.  Moreover, earlier Strack distinguished student samples and non-student samples as a potential moderator.  It makes sense to include this moderator in the analysis because it could be confounded with sample size.

A regression analysis shows that the effect of proper sample is no longer significant, t(14) = 1.75, and the effect of sampling error (1/sqrt(N) is also not significant, t(14) = 0.59.

This new analysis suggests that sample size is not a predictor of effect sizes, which makes sense because there is no reasonable explanation for such a positive correlation.

However, now Strack makes a mistake by implying that the weaker effect sizes in smaller samples could be a sign of  “reverse p-hacking.”

Without insinuating the possibility of a reverse p hacking, the current anomaly
needs to be further explored.

The rhetorical vehicle of “Without insinuating” can be used to backtrack from the insinuation that was never made, but Strack is well aware of priming research and the ironic effect of instructions “not to think about a white bear”  (you probably didn’t think about one until now and nobody was thinking about reverse p-hacking until Strack mentioned it).

Everybody understood what he was implying (reverse p-hacking = intentional manipulations of the data to make significance disappear and discredit original researchers in order to become famous with failures to replicate famous studies) (Crystal Prison Zone blog; No Hesitations blog).

revers.phack.twitter.png

The main mistake that Strack makes is that a negative correlation between sample size and effect size can suggest that researches were p-hacking (inflating effect sizes in small samples to get significance), but a positive correlation does not imply reverse p-hacking (making significant results disappear).  Reverse p-hacking also implies a negative trend line where researchers like Wagenmakers with larger samples (the 2nd largest sample in the set), who love to find non-significant results that Bayesian statistics treat as evidence for the absence of an effect, would have to report smaller effect sizes to avoid significance or Bayes-Facotors in favor of an effect.

So, here is Strack’s fundamental error.   He correctly assumes that p-hacking results in a negative correlation, but he falsely assumes that reverse p-hacking would produce a positive correlation and then treats a positive correlation as evidence for reverse p-hacking. This is absurd and the only way to backtrack from this faulty argument is to use the “without insinuating” hedge (“I never said that this correlation implies reverse p-hacking, in fact I made clear that I am not insinuating any of this.”)

Questionable Research Practices as Moderator 

Although Strack mentions two plausible moderators (student samples, camera), there are other possible moderators that could explain the discrepancies between his original results and the RRR results.  One plausible moderator that Strack does not mention is that the original results were obtained with questionable research practices.

Questionable research practices is a broad term for a variety of practices that undermine the credibility of published results  (John et al., 2012), including fraud.

To be clear, I am not insinuating that Strack fabricated data and I have said so in public before.  Let’ me be absolutely clear because the phrasing I used in the previous sentence is a stab at Strack’s reverse p-hacking quote, which may be understood as implying exactly what is not being insinuated.  I positively do not believe that Fritz Strack faked data for the 1988 article or any other article.

One reason for my belief is that I don’t think anybody would fake data that produce only marginally significant results that some editors might reject as insufficient evidence. If you fake your data, why fake p = .06, if you can fake p = .04?

If you take a look at the Figure above, you see that the 95%CI of the original study includes zero.  That shows that the original study did not show a statistically significant result.  However, Strack et al. (1988) used a more liberal criterion of .10 (two-tailed) or .05 (one-tailed) to test significance and with this criterion the results were significant.

The problem is that it is unlikely for two independent studies to produce two marginally significant results in a row.  This is either an unlikely fluke or some other questionable research practices were used to get these results. So, yes, I am insinuating that questionable research practices may have inflated the effect sizes in Strack et al.’s studies and that this explains at least partially why the replication study failed.

It is important to point out that in 1988, questionable research practices were not considered questionable. In fact, experimental social psychologists were trained and trained their students to use these practices to get significance (Stangor, 2012).  Questionable research practices also explain why the reproducibilty project could only replicate 25% of published results in social and personality psychology (OSC, 2015).  Thus, it is plausible that QRPs also contributed to the discrepancy between Strack et al.’s original studies and the RRR results.

The OSC reproducibility project estimated that QRPs inflate effect sizes by 100%. Thus, an inflated effect size of d = .5 in Strack et al., (1988) might correspond to a true effect size of d = .25 (.25 real + 25 inflation = .50 observed).  Moreover, the inflation increases as p-values get closer to .05.  Thus, for a marginally significant result, inflation is likely to be greater than 100% and the true effect size is likely to be even smaller than .25. This suggest that the true effect size could be around d = .2, which is close to the effect size estimate of the “proper” RRR studies identified by Strack.

A meta-analysis of facial feedback studies also produced an average effect size estimate of d = .2, but this estimate is not corrected for publication bias, while the meta-analysis also showed evidence of publication bias.  Thus, the true average effect size is likely to be lower than .2 standard deviations.  Given the heterogeneity of studies in this meta-analysis it is difficult to know which specific paradigms are responsible for this average effect size and could be successfully replicated (Coles & Larsen, 2017).  The reason is that existing studies are severely underpowered to reliably detect effects of this magnitude (Replicability Report of Facial Feedback studies).

The existing evidence suggests that effect sizes of facial feedback studies are somewhere between 0 and .5, but it is impossible to say whether it is just slightly above zero with no practical significance or whether it is of sufficient magnitude so that a proper set of studies can reliably replicate the effect.  In short, 30 years and over 600 citations later, it is still unclear whether facial feedback effects exist and under which conditions this effect can be observed in the laboratory.

Did Fritz Strack use Quesitonable Research Practices

It is difficult to demonstrate conclusively that a researcher used QRPs based on a couple of statistical results, but it is possible to examine this question with larger sets of data. Brunner & Schimmack (2018)  developed a method, z-curve, that can reveal the use of QRPs for large sets of statistical tests.  To obtain a large set of studies, I used automatic text extraction of test statistics from articles co-authored by Fritz Strack.  I then applied z-curve to the data.  I limited the analysis to studies before 2010 when social psychologists did not consider QRPs problematic, but rather a form of creativity and exploration.   Of course, I view QRPs differently, but the question whether QRPs are wrong or not is independent of the question whether QRPs were used.

Strack.zcurve.png

This figure (see detailed explanation here) shows the strength of evidence (based on test statistics like t and F-values converted into z-scores in Strack’s articles. The histogram shows a mode at 2, which is just significant (z = 1.96 ~ p = .05, two-tailed).  The high bar of z-scores between 1.8 and 2 shows marginally significant results as does the next bar with z-scores between 1.6 and 1.8 (1.65 = .05 one-tailed).  The drop from 2 to 1.6 is too steep to be compatible with sampling error.

The grey line provides a vague estimate of the expected proportion of non-significant results. The so called file-drawer (non-significant results that are not reported) is very large and it is unlikely that so many studies were attempted and failed. For example, it is unlikely that Strack conducted 10 more studies with N = 100 participants and did not report the results of these studies because they produced non-significant result.  Thus, it is likely that other QRPs were used that help to produce significance.  It is impossible to say how significant results were produced, but the distribution z-scores strongly suggests that QRPS were used.

The z-curve method provides an estimate of the average power of significant results. The estimate is 63%.  This value means that a randomly drawn significant result from one of Strack’s articles has a 63% probability of producing a significant results again in an exact replication study with the same sample size.

This value can be compared with estimates for social psychology using the same method. For example, the average for the Journal of Experimental Social Psychology from 2010-2015 is 59%.  Thus, the estimate for Strack is consistent with general research practices in his field.

One caveat with the z-curve estimate of 63% is that the dataset includes theoretically important and less important tests.  Often the theoretically important tests have lower z-scores and the average power estimates for so-called focal tests are about 20-30 percentage points lower than the estimate for all tests.

A second caveat is that there is heterogeneity in power across studies. Studies with high power are more likely to produce really small p-values and larger z-scores. This is reflected in the estimates below the x-axis for different segments of studies.  The average for studies with just significant results (z = 2 to 2.5) is only 45%.  The estimate for marginally significant results is even lower.

As 50% power corresponds to the criterion for significance, it is possible to use z-curve as a way to adjust p-values for the inflation that is introduced by QRPs.  Accordingly, only z-scores greater than 2.5 (~ p = .01) would be significant at the 5% level after correcting for QRPs.  However, once QRPs are being used, bias-corrected values are just estimates and validation with actual replication studies is needed.  Using this correction, Strack’s original results would not even meet the weak standard of for marginal significance.

Final Words

In conclusion, 30 years of research have failed to provide conclusive evidence for or against the facial feedback hypothesis. The lack of progress is largely due to a flawed use of the scientific method.  As journals published only successful outcomes, empirical studies failed to provide empirical evidence for or against a hypothesis derived from a theory.  Social psychologists only recognized this problem in 2011, when Bem was able to provide evidence for an incredible phenomenon of erotic time travel.

There is no simple answer to Fritz Strack’s question “Have I done something wrong?” because there is no objective standard to answer this question.

Did Fritz Strack fail to produce empirical evidence for the facial feedback hypothesis because he used the scientific method wrong?  I believe the answer is yes.

Did Fritz Strack do something morally wrong in doing so? I think the answer is no. He was trained and trained students in the use of a faulty method.

A more difficult question is whether Fritz Strack did something wrong in his response to the replication crisis and the results of the RRR.

We can all make honest mistakes and it is possible that I made some honest mistakes when I wrote this blog post.  Science is hard and it is unavoidable to make some mistakes.  As the German saying goes (Wo gehobelt wird fallen Späne) which is equivalent to saying “you can’t make an omelette without breaking some eggs.

Germans also have a saying that translates to “those who work a lot make a lot of mistakes, those who do nothing make no mistakes”  Clearly Fritz Strack did a lot for social psychology, so it is only natural that he also made mistakes.  The question is how scientists respond to criticism and discovery of mistakes by other scientists.

The very famous social psychologists Susan Fiske (2015) encouraged her colleagues to welcome humiliation.  This seems a bit much, especially for Germans who love perfection and hate making mistakes.  However, the danger of a perfectionistic ideal is that criticism can be interpreted as a personal attack with unhealthy consequences. Nobody is perfect and the best way to deal with mistakes is to admit them.

Unfortunately, many eminent social psychologists seem to be unable to admit that they used QRPs and that replication failures of some of their famous findings are to be expected.  It doesn’t require rocket science to realize that p-hacked results do not replicate without p-hacking. So, why is it so hard to admit the truth that everybody knows anyways?

It seems to be human nature to cover up mistakes. Maybe this is an embodied reaction to shame like trying to cover up when a stranger sees us naked.  However, typically this natural response is worse and it is better to override it to avoid even more severe consequences. A good example is Donald Trump. Surely, having sex with a pornstar is a questionable behavior for a married man, but this is no longer the problem for Donald Trump.  His presidency may end early not because he had sex with Stormy Daniels, but because he lied about it.  As the saying goes, the cover-up is often worse than the crime. Maybe there is even a social psychological experiment to prove it, p = .04.

P.S.  There is also a difference between not doing something wrong and doing something right.  Fritz, you can still do the right thing and retract your questionable statement about reverse-phacking and help new generations to avoid some of the mistakes of the past.

 

 

 

 

 

 

 

 

 

 

 

 

Estimating Reproducibility of Psychology (No. 111): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article examines anchoring effects.  When individuals are uncertain about a quantity (the price of a house, the height of Mount Everest), their estimates can be influenced by some arbitrary prior number.  The anchoring effect is a robust phenomenon that has been replicated in one of the first large multi-lab replication projects (Klein et al., 2014).

This article titled “Precision of the anchor influence the amount of adjustment” tested the hypothesis that the anchoring effect is larger (or the adjustment effect is smaller) if the anchor is precise than if it is rounded in six studies.

Janiszewski.png

Study 1

43 students participated in this study that manipulated the type of anchor between subjects (rounded, precise over, precise under; n = 14 per cell).

The main effect of the manipulation was significant, F(2,40) = 10.94.

Study 2

85 students participated in this study.  It manipulated anchor (rounding, precise under) and the range of plausible values (narrow vs. broad).

The study replicated a main effect of anchor, F(1,81) = 22.23.

Study 3

45 students participated in this study.

Study 3 added a condition with information that made the rounded anchor more credible.

The results were significant, F(2,42) = 23.07.  A follow up test showed that participants continued to be more influenced by a precise anchor than by a rounded anchor even with additional information that the rounded anchor was credible, F(1, 42) = 20.80.

Study 4a 

This study was picked for the replication attempt.

As the motivation to adjust increases and the number of units of adjustment increases correspondingly, the amount of adjustment on the coarse-resolution
scale should increase at a faster rate than the amount of adjustment on the fine-resolution scale (i.e., motivation to adjust and scale resolution should interact).

The high-motivation-to-adjust condition was created by removing information from the scenarios used in Experiment 2 (the scenarios from Experiment 2 were used without alteration in the low-motivation-to-adjust condition). For example, sentences
in the plasma-TV scenario that encouraged a slight adjustment (‘‘items are priced very close to their actual cost . . . actual cost would be only slightly less than $5,000’’) were replaced with a sentence that encouraged more adjustment (‘‘What is your estimate of the TV’s actual cost?’’).

The width of the scale unit was manipulated with the precision of the anchor (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a had 59 participants.

Study 4a was similar to Study 1 with a manipulation of the width of the scale. (i.e., rounded anchor for broad width and precise anchor for narrow width).

Study 4a showed an interaction between the motivation to adjust condition and the Scale Width manipulation, F(1, 55) = 6.88.

Study 4b

Study 4b  had 149 participants and also showed a significant result, F(1,145) = 4.01, p = .047.

Study 5 

Study 5 used home-sales data for 12,581 home sales.  The study found a significant effect of list-price precision on the sales price, F(1, 12577) = 23.88, with list price as a covariate.

Summary

In conclusion, all of the results showed strong statistical evidence against the null hypothesis except for the pair of studies 4a and 4b.  It is remarkable that this close replication study produced a just significant result with three times with a much larger sample size than Study 4a (149 vs. 59).  This pattern of results suggest that the sample size is not independent of the result and that the evidence for this effect could be exeggerated by  the use of optional stopping (collecting more data until p < .05).

Replication Study

The replication study did not use Study 5 which was not an experiment.  Study 4a was chosen over 4b because it used a more direct manipulation of motivation.

The replication report states the goal of the replication study as replicating two main effects and the interaction.

The results show a main effect of the motivation manipulation, F(1,116) = 71.06, a main effect of anchor precision, F(1,116) = 6.28, but no significant interaction, F < 1.

The data form shows the interaction effect as the main result in the original study, F(1, 55) = 6.88, but the effect is miscoded as a main effect.  The replication result is entered as the main effect for the anchor precision manipulation, F(1, 116) = 6.28 and this significant result is scored as a successful replication of the original study.

However, the key finding in the original article was the interaction effect.  No statistical tests of main effects are reported.

In Experiment 4a, there was a Motivation to Adjust x Scale Unit Width interaction, F(1, 55) = 6.88, prep = .947, omega2 = .02. The difference in the amount of adjustment between the rounded and precise-anchor conditions increased as the motivation to
adjust went from low (Mprecise = -0.76, Mrounded = -0.23, Mdifference = 0.53), F(1, 55) = 15.76, prep = .994, omega2 = .06, to high (Mprecise = -0.04, Mrounded = 0.98, Mdifference = 1.02), F(1, 55) = 60.55, prep =.996, omega2 = .25. 

This leads me to the conclusion that the successful replication of this study is a coding mistake. The critical interaction was not replicated.

 

 

 

 

 

 

Estimating Reproducibility of Psychology (No. 136): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Special Introduction

The authors of this article are prominent figures in the replication crisis of social psychology.  Kathleen Vohs was co-author of a highly criticized article that suggested will-power depends on blood-glucose levels. The evidence supporting this claim has been challenged for methodological and statistical reasons (Kurzban, 2010; Schimmack, 2012).  She also co-authored numerous articles on ego-depletion that are difficult to replicate (Schimmack, 2016; Inzlicht, 2016).  In a response with Roy Baumeister titled “A misguided effort with elusive implications” she dismissed these problems,b ut her own replication project produced very similar results (SPSP, 2018). Some of her social priming studies also failed to replicate (Vadillo, Hardwicke, & Shanks, 2016). 

Vohs

The z-curve plot for Vohs shows clear evidence that her articles contain too many significant results (75% inc. marginally significant one’s with only 55% average power).  The average probability of successfully replicating a randomly drawn finding form Vohs’ articles is 55%. However, this average is obtained with substantial heterogeneity.  Just significant z-scores (2 to 2.5) have only an average estimated replicability of 33%. Even z-scores in the range from 2.5 to 3 have only an average replicability of 40%.   This suggests that p-values in the range between .05 and .005 are unlikely to replicate in exact replication studies.

In a controversial article with the title “The Truth is Wearing Off“, Jonathan Schooler even predicted that replication studies might often fail.  The article was controversial because Schooler suggested that all effects diminish over time (I wish this were true for the effect of eating chocolate on weight, but so far it hasn’t happened).  Schooler is also known for an influential article about “verbal overshadowing” in eyewitness identifications.  Francis (2012) demonstrated that the published results were too good to be true and the first Registered Replication Report failed to replicate on of the five studies and replicated another one only with a much smaller effect size.

Schooler.png

The z-curve plot for Schooler looks very different. The average estimated power is higher.  However, there is a drop at z = 2.6 that is difficult to explain with a normal sampling distribution.

Based on this context information, predictions about replicabilty depend on the p-values of the actual studies.  Just significant p-values are unlikely to replicate but larger p-values might replicate.

Summary of Original Article

The article examines moral behavior.  The main hypothesis is that beliefs about free will vs. determinism influence cheating.  Whereas belief in free will encourages moral behavior,  beliefs that behavior is determined make it easier to cheat.

Study 1

30 students were randomly assigned to one of two conditions (n = 15).

Participants in the anti-free-will condition, read a passage written by Francis Crick, a Noble Laureate, suggesting that free will is an illusion.  In the control condition, they read about consciousness.

Then participants were asked to work on math problems on a computer. They were given a cover story that the computer program had a glitch and would present the correct answers, but they could fix this problem by pressing the space bar as soon as the question appeared.  They were asked to do so and to try to solve the problems on their own.

It is not mentioned whether participants were probed for suspicion and data from all participants were included in the analysis.

The main finding was that participants cheated more in the “no-free-will” condition than in the control condition, t(28) = 3.04, p = .005.

Study 2

Study 2 addressed several limitations of Study 1. Although the sample size was larger, the design included 5 conditions (n = 24/25 per condition).

The main dependent variable was the number of correct answers on 15 reading comprehension, mathematical, and logic problems that were used by Vohs in a previous study (Schmeichel, Vohs, & Baumeister, 2003).  For each correct answer, participants received $1.

Two conditions manipulate free will beliefs, but participants could not cheat. The comparison of these two conditions shows whether the manipulation influences actual performance, but there was no major difference (based on Figure $7.50 control vs. $7 no-free-will).

In the cheating condition, experimenters received a fake phone call, told the participants that they had to leave and that the participant should continue, score their answers and pay themselves.  Surprisingly, neither the free-will, nor the neutral condition showed any signs of cheating ($7.20 & 7.30, respectively).  However, the determinism condition increased the average pay-out to $10.50.

One problem for the statistical analysis is that the researchers “did not have participants’ answer sheets in the three self-paid conditions; therefore, we divided the number of $1 coins taken by each group by the number of group members to arrive at an average self-payment” (p. 52).

The authors then report a significant ANOVA result, F(4, 114) = 5.68, p = .0003.

However, without information about the standard deviation in each cell, it is not possible to compute an Analysis of Variance.  This part of the analysis is not explained in the article.

Replication 

The replication team also had some problems with Study 2.

We originally intended to carry out Study 2, following the Reproducibility Project’s system of working from the back of an article. However, on corresponding with the authors we discovered that it had arisen in post-publication correspondence about analytic methods that the actual effect size found was smaller than reported, although the overall conclusion remained the same. 

As a result, they decided to replicate Study 1.

The sample size of the replication study was near twice as large as the sample of the original study (N = 58 vs. 30).

The results did not replicate the significant result of the original study, t(56) = 0.77, p = .44.

Conclusion

Study 1 was underpowered.  Even nearly doubling the sample size was not sufficient to obtain significance in the replication study.   Study 2 was superior, but it was reported so poorly that the replication team could not replicate the study.

 

 

 

 

 

Estimating Reproducibility of Psychology (No. 124): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article

The article “Loving Those Who Justify Inequality: The Effects of System Threat on Attraction to Women Who Embody Benevolent Sexist Ideals”  is a Short Report in the journal Psychological Science.  The single study article is based on Study 3 of a doctoral dissertation supervised by the senior author Steven J. Spencer.

Spencer.png

The article has been cited 32 times and has not been cited in 2017 (but has one citation in 2018 so far).

Study 

The authors aim to provide further evidence for system-justification theory (Jost, Banaji, & Nosek, 2004).  A standard experimental paradigm is to experimentally manipulate beliefs in the fairness of the existing political system.  According to the theory, individuals are motivated to maintain positive views of the current system and will respond by threats to this belief in a defensive manner.

In this specific study, the authors predicted that male participants whose faith in the
political system was threatened would show greater romantic interest in women who embody benevolent sexist ideals than in women who do not embody these ideals.

The design of the study is a classic 2 x 2 design with system threat as between-subject factor and type of women (embody benevolent sexist ideals or not) as within-subject factor.

Stimuli were fake dating profiles.  Dating profiles of women who embody benevolent sexist ideals were based on the three dimensions of benevolent sexism, vulnerable, pure, and ideal for making a men feel complete (Glick & Fiske, 1996). The other women were described as career oriented, party seeking, active in social causes, or athletic.

A total of 36 male students participated in the study.

The article reports a significant interaction effect, F(1, 34) =5.89.  This interaction effect was due to a significant difference between the two groups in rating of women who embody benevolent sexist ideals, F(1,34) = 4.53.

Replication Study 

The replication study was conducted in Germany.

It failed to replicate the significant interaction effect, F(1,68) = 0.08, p = .79.

Conclusion

The sample size of the original study was very small and the result was just significant.  It is not surprising that a replication study failed to replicate this just significant result despite a somewhat larger sample size.

 

 

 

Estimating Reproducibility of Psychology (No. 61): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “Poignancy: Mixed emotional experience in the face of meaningful endings” was published in the Journal of Personality and Social Psychology.  The senior author is Laura L. Carstensen, who is best known for her socioemotional selectivity theory (Carstensen, 1999, American Psychologist).  This article has been cited (only) 83 times and is only #43 in the top cited articles of Laura Carstensen, although it contributes to her current H-Index of 49.

Carstensen.png

The main hypothesis is derived from Carstensen’s socioemotional selectivity theory.  The prediction is that endings (e.g., of student life, of life in general) elicit more mixed emotions.  This hypothesis was tested in two experiments.

Study 1

60 young (~ 20 years) and 60 older ~ 80 years) participated in Study 1.  The experimental procedure was a guided imagery to evoke emotions.   In one condition participants were asked to imagine in their favorite location in 4 months time.  In the other condition they were given the same instruction, but also told to imagine that this would be the last time they could visit this location.  The dependent variable were intensity ratings on an emotion questionnaire on a scale from 0 = not at all to 7 = extremely.

The intensity of mixed feelings was assessed by taking the minimum value of a positive and a negative emotion (Schimmack, 2001).

The analysis showed no age main effect or interactions and no differences in two control conditions.  For the critical imagery condition,  intensity of mixed feelings was higher in the last-time condition (M ~ 3.6, SD ~ 2.3) than in the next-visit condition (M ~ 2, SD ~ 2.3), d ~ .7,  t(118) ~ 3.77.

Study 2

Study 2 examined mixed feelings in the context of a naturalistic event.  It extend a previous study by Larsen, McGraw, & Cacioppo (2001) that demonstrated mixed feelings on graduation day.  Study 2 aimed to replicate and extend this finding.  To extend the finding, the authors added an experimental manipulation that either emphasized the ending of university or not.

110 students participated in the study.

In the control condition (N = 59), participants were given the following instructions: “Keeping in mind your current experiences, please rate the degree to which you feel each of the following emotions,” and were then presented with the list of 19 emotions. In the limited-time condition (n = 51), in which emphasis was placed on the ending that they were experiencing, participants were given the following instructions: “As a graduating senior, today is the last day that you will be a student at Stanford. Keeping that in mind, please rate the degree to which you feel each of the following emotions,”

The key finding was significantly higher means in the experimental condition than in the control condition, t(108) = 2.34, p = .021.

Replication Study

Recruiting participants on graduation day is not easy.  The replication study recruited participants over a 3-year period to achieve a sample size of N = 222 participants, more than double the sample size of the original study (2012 N = 95; 2013 N = 78; 2014 N = 49).

Despite the larger sample size, the study failed to replicate the effect of the experimental manipulation, t(220) = 0.07, p = .94.

Conclusion

While reports of mixed feelings in conflicting situations are a robust phenomenon (Study 1), experimental manipulations of the intensity of mixed feelings are relatively rare. The key novel contribution of Study 2 was the demonstration to focus on the ending of an event increase sadness and mixed feelings. However, the evidence for this effect was weak and could not be replicated in a larger sample. In combination, the evidence does not suggest that this is an effective way to manipulate the intensity of mixed feelings.

 

 

 

 

 

 

 

 

 

 

In Study 1, participants repeatedly imagined being in a meaningful location. Participants in the experimental condition imagined being in the meaningful
location for the final time. Only participants who imagined “last times” at meaningful locations
experienced more mixed emotions. In Study 2, college seniors reported their emotions on graduation day.
Mixed emotions were higher when participants were reminded of the ending that they were experiencing.
Findings suggest that poignancy is an emotional experience associated with meaningful endings.

 

Estimating Reproducibility of Psychology (No. 165): An Open Post-Publication Peer-Review

Introduction

In 2015, Science published the results of the first empirical attempt to estimate the reproducibility of psychology.   One key finding was that out of 97 attempts to reproduce a significant result, only 36% of attempts succeeded.

This finding fueled debates about a replication crisis in psychology.  However, there have been few detailed examinations of individual studies to examine why a particular result could be replicated or not.  The main reason is probably that it is a daunting task to conduct detailed examinations of all studies. Another reason is that replication failures can be caused by several factors.  Each study may have a different explanation.  This means it is important to take an ideographic (study-centered) perspective.

The conclusions of these ideographic reviews will be used for a nomothetic research project that aims to predict actual replication outcomes on the basis of the statistical results reported in the original article.  These predictions will only be accurate if the replication studies were close replications of the original study.  Otherwise, differences between the original study and the replication study may explain why replication studies failed.

Summary of Original Article 

The article “The Value Heuristic in Judgments of Relative Frequency” was published as a Short Report in Psychological Science.   The article has been cited only 25 times overall and was not cited at all in 2017.

Dai.WoS

The authors suggest that they have identified a new process how people judge the relative frequency of objects.

Estimating the relative frequency of a class of objects or events is fundamental in subjective probability assessments and decision making (Estes, 1976), and research has long shown that people rely on heuristics for making these judgments (Gilovich, Griffin, & Kahneman, 2002). In this report, we identify a novel heuristic for making these judgments, the value heuristic: People judge the frequency of a class of objects on the basis of the subjective value of the objects.

As my dissertation was about frequency judgments of emotions, I am familiar with the frequency estimation literature, especially the estimation of valued objects like positive and negative emotions.  My reading of the literature suggests that this hypothesis is inconsistent with prior research because frequency judgments are often made on the basis of a fast, automatic, parallel search of episodic memory (e..g, Hintzman, 1988). Thus, value might only indirectly influence frequency estimates if it influences the accessibility of exemplars.

The authors present a single experiment to support their hypothesis.

Experiment

68 students participated in this study.  5 were excluded for a final N of 63 students.

During the learning phase of the study , participants were exposed to 57 pictures of bird and 57 pictures of flowers.

Participants were  then told that they would receive 2 cent for each picture from one of the two categories. The experimental manipulation was whether participants would be rewarded for bird or flower pictures.

The dependent variables were frequency estimates of the number of pictures in each category. Specifically, whether participants gave a higher or lower, equal, or higher estimate to the rewarded category.

When flowers were rewarded, 12 participants had higher estimates for flowers and 15 had higher estimates for birds.

When birds were rewarded, 21 participants had higher estimates for bird and 8 had higher estimates for birds.

A chi-square test showed a just significant effect that was driven by the condition that rewarded birds, presumably because there was also a main effect of birds vs. flowers (birds are more memorable and accessible).

Chi2(1, N = 56) = 4.51,  p = .037.

Replication 

81 students participated in the replication study.  After exclusion of 4 participants the final sample size was N = 77.

When flowers were rewarded, 16 participants had higher estimates for flowers and 11 had higher estimates for birds.

When birds were rewarded, 10 participants had higher estimates for bird and 14 had higher estimates for birds.

The remaining participants were tied.

The chi-square test was not significant.

Chi2(1, N = 51) = 1.57,  p = .21.

Conclusion

The original article tested a novel and controversial hypothesis that was not grounded in the large cognitive literature on frequency estimation.  The article relied on a just significant result in a single study as evidence.  It is not surprising that a replication study failed to replicate the finding.  The article had very little impact. In hindsight, this study does not meet the high bar for acceptance into a high impact journal like Psychological Science.  However, hindsight is 20/20 and it is well known that the foresight of traditional peer-review is an imperfect predictor of replicability and relevance.