Rindex Logo

Dr. R’s Blog about Replicability


DEFINITION OF REPLICABILITY:  In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.


New (September 17, 2016):  Estimating Replicability: abstract and link to manuscript about statistical methods that can be used to estimate replicability on the basis of published test statistics. 

REPLICABILITY REPORTS:  Examining the replicability of research topics

RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?




1. 2015 Replicability Rankings of over 100 Psychology Journals
Based on reported test statistics in all articles from 2015, the rankings show the typical strength of evidence for a statistically significant result in particular journals.  The method also estimates the file-drawer of unpublished non-significant results. Links to powergraphs provide further information (e.g., whether a journal has too many just significant results (p < .05 & p > .025).


2. A (preliminary) Introduction to the Estimation of Replicability for Sets of Studies with Heterogeneity in Power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.


3.  Replicability-Rankings of Psychology Departments
This blog presents rankings of psychology departments on the basis of the replicability of significant results published in 105 psychology journals (see the journal rankings for a list of journals).   Reported success rates in psychology journals are over 90%, but this percentage is inflated by selective reporting of significant results.  After correcting for selection bias, replicability is 60%, but there is reliable variation across departments.


4. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

5.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z < 1.96, p > .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

YM figure9

6.  Validation of Meta-Analysis of Observed (post-hoc) Power
This post examines the ability of various estimation methods to estimate power of a set of studies based on the reported test statistics in these studies.  The results show that most estimation methods work well when all studies have the same effect size (homogeneous) or if effect sizes are heterogeneous and symmetrically distributed (heterogeneous). However, most methods fail when effect sizes are heterogeneous and have a skewed distribution.  The post does not yet include the more recent method that uses the distribution of z-scores (powergraphs) to estimate observe power because this method was developed after this blog was posted.


7. Roy Baumeister’s R-Index
Roy Baumeister was a reviewer of my 2012 article that introduced the Incredibiliy Index to detect publication bias and dishonest reporting practices.  In his review and in a subsequent email exchange, Roy Baumeister admitted that his published article excluded studies that failed to produce results in support of his theory that blood-glucose is important for self-regulation (a theory that is now generally considered to be false), although he disagrees that excluding these studies was dishonest.  The R-Index builds on the incredibility index and provides an index of the strength of evidence that corrects for the influence of dishonest reporting practices.  This post reports the R-Index for Roy Baumeister’s most cited articles. The R-Index is low and does not justify the nearly perfect support for empirical predictions in these articles. At the same time, the R-Index is similar to R-Indices for other sets of studies in social psychology.  This suggests that dishonest reporting practices are the norm in social psychology and that published articles exaggerate the strength of evidence in support of social psychological theories.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/8. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.


9.  The R-Index for 18 Multiple-Study Psychology Articles in the Journal SCIENCE.
Francis (2014) demonstrated that nearly all multiple-study articles by psychology researchers that were published in the prestigious journal SCIENCE showed evidence of dishonest reporting practices (disconfirmatory evidence was missing).  Francis (2014) used a method similar to the incredibility index.  One problem of this method is that the result is a probability that is influenced by the amount of bias and the number of results that were available for analysis. As a result, an article with 9 studies and moderate bias is treated the same as an article with 4 studies and a lot of bias.  The R-Index avoids this problem by focusing on the amount of bias (inflation) and the strength of evidence.  This blog post shows the R-Index of the 18 studies and reveals that many articles have a low R-Index.


10.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.














Murderer back on the crime scene

How did Diedrik Stapel Create Fake Results? A forensic analysis of “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations”

Diederik Stapel represents everything that has gone wrong in social psychology.  Until 2011, he was seen as a successful scientists who made important contributions to the literature on social priming.  In the article “From Seeing to Being: Subliminal Social Comparisons Affect Implicit and Explicit Self-Evaluations” he presented 8 studies that showed that social comparisons can occur in response to stimuli that were presented witout awareness (subliminally).  The results were published in the top journal of social psychology published by the American Psychological Association (APA) and APA published a press-release for the general public about this work.  
In 2011, an investigation into Diedrik Stapel’s reserach practices revealed scientific fraud, which resulted in over 50 retractions (Retraction Watch), including the article on unconscious social comparisons (Retraction Notice).  In a book, Diederik Stapel told his story about his motives and practices, but the book is not detailed enough to explain how particular datasets were fabricated.  All we know, is that he used a number of different methods that range from making up datasets to the use of questionable research practices that increase the chance of producing a significant result.  These practices are widely used and are not considered scientific fraud, although the end result is the same. Published results no longer provide credible empirical evidence for the claims made in a published article.
I had two hypothesis. First, the data could be entirely made up. When researchers make up fake data they are likely to overestimate the real effect sizes and produce data that show the predicted pattern much more clearly than real data would. In this case, bias tests would not show a problem with the data.  The only evidence that the data are fake would be that the evidence is stronger than in other studies that relied on real data. 
In contrast, a researcher who starts with real data and then uses questionable practices is likely to use as little dishonest practices as possible because this makes it easier to justify the questionable decisions.  For example, removing 10% of data may seem justified, especially if some rational for exclusion can be found.  However, removing 60% of data cannot be justified.  The researcher will need to use these practices to produce the desired outcome, namely a p-value below .05 (or at least very close to .05).  As more use of questionable practices is not needed and harder to justify, the researcher will stop producing stronger evidence.  As a result, we would expect a large number of just significant results.
There are two bias tests that detect the latter form of fabricating significant results by means of questionable statistical methods; the Replicability-Index (R-Index) and the Test of Insufficient Variance (TIVA).   If Stapel used questionable statistical practices to produce just significant results, R-Index and TIVA would show evidence of bias.
The article reported 8 studies. The table shows the key finding of each study.
Study Statistic p z OP
1 F(1,28)=4.47 0.044 2.02 0.52
2A F(1,38)=4.51 0.040 2.05 0.54
2B F(1,32)=4.20 0.049 1.97 0.50
2C F(1,38)=4.13 0.049 1.97 0.50
3 F(1,42)=4.46 0.041 2.05 0.53
4 F(2,49)=3.61 0.034 2.11 0.56
5 F(1,29)=7.04 0.013 2.49 0.70
6 F(1,55)=3.90 0.053 1.93 0.49
All results were interpreted as evidence for an effect and the p-value for Study 6 was reported as p = .05.
All p-values are below .053 but greater than .01.  This is an unlikely outcome because sampling error should produce more variability in p-values.  TIVA examines whether there is insufficient variability.  First, p-values are converted into z-scores.  The variance of z-scores due to sampling error alone is expected to be approximately 1.  However, the observed variance is only Var(z) = 0.032.  A chi-square test shows that this observed variance is unlikely to occur by chance alone,  p = .00035. We would expect such an extremely small variabilty or even less variability in only 1 out of 28,458 sets of studies by chance alone.
The last column transforms z-scores into a measure of observed power. Observed power is an estimate of the probability of obtaining a significant result under the assumption that the observed effect size matches the population effect size.  These estimates are influenced by sampling error.  To get a more reliable estimate of the probability of a successful outcome, the R-Index uses the median. The median is 53%.  It is unlikely that a set of 8 studies with a 53% chance of obtaining a significant result produced significant results in all studies.  This finding shows that the reported success rate is not credible. To make matters worse, the probability of obtaining a significant result is inflated when a set of studies contains too many significant results.  To correct for this bias, the R-Index computes the inflation rate.  With 53% probability of success and 100% success rate, the inflation rate is 47%.  To correct for inflation, the inflation rate is subtracted from median observed probability, which yields an R-Index of 53% – 47% = 6%.  Based on this value, it is extremely unlikely that a researcher would obtain a significant result, if they would actually replicate the original studies exactly.  The published results show that Stapel could not have produced these results without the help of questionable methods, which also means nobody else can reproduce these results.
In conclusion, bias tests suggest that Stapel actually collected data and failed to find supporting evidence for his hypotheses.  He then used questionable practices until the results were statistically significant.  It seems unlikely that he outright faked these data and intentionally produced a p-value of .053 and reported it as p = .05.  However, statistical analysis can only provide suggestive evidence and only Stapel knows what he did to get these results.

A sarcastic comment on “Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology” by Harry Reis

“Promise, peril, and perspective: Addressing concerns about reproducibility in social–personality psychology”
Journal of Experimental Social Psychology 66 (2016) 148–152
DOI: http://dx.doi.org/10.1016/j.jesp.2016.01.005

a.k.a The Swan Song of Social Psychology During the Golden Age

Disclaimer: i wrote this piece because Jamie Pennebeker recommended writing as therapy to deal with trauma.  However, in his defense, he didn’t propose publishing the therapeutic writings.


You might think an article with reproducibiltiy in the title would have something to say about the replicability crisis in social psychology.  However, this article has very little to say about the causes of the replication crisis in social psychology and possible solutions to improve replicability. Instead, it appears to be a perfect example of repressive coping to avoid the traumatic realization that decades of work were fun, yet futile.

1. Introduction

The authors start with a very sensible suggestion. “We propose that the goal of achieving sound scientific insights and useful applications will be better facilitated over the long run by promoting good scientific practice rather than by stressing the need to prevent any and all mistakes.”  (p. 149).  The only question is how many mistakes we consider tolerable and that we do not know what the error rates are. Rosenthal pointed out it could be 100%, which even the authors might consider to be a little bit too high.

2. Improving research practice”

In this chapter, the authors suggest that “if there is anything on which all researchers might agree, it is the call for improving our research practices and techniques.” (p. 149).  If this were the case, we wouldn’t see articles in 2016 that make statistical mistakes that have been known for decades like pooling data from a heterogeneous set of studies or computing difference scores and using one of the variables as a predictor of the difference score.

It is also puzzling to read “the contemporary literature indicates just how central methodological innovation has been to advancing the field” (p. 149), when the key problem of low power has been known since 1962 and there is still no sign of improvement.

The authors also are not exactly in favor of adapting better methods, when these methods might reveal major problems in older studies.  For example, a meta-analysis in 2010 might not have examined publication bias and produced an effect size of more than half a standard deviation, when a new method that controls for publication bias finds that it is impossible to reject the null-hypothesis. No, these new methods are not welcome. “In our  view, they will stifle progress and innovation if they are seen primarily through the lens of maladaptive perfectionism; namely as ways of rectifying flaws and shortcomings in prior work.”  (p. 149).  So, what is the solution. Let’s pretend that subliminal priming made people walk slower in 1996, but stopped working in 2011?

This ends the chapter of improving research practice.  Yes, that is the way to deal with a crisis.  When the city is bankrupt, cut back on the Christmas decorations. Problem solved.

3. How to think about replications

Let’s start with a trivial statement that is as meaningless as saying, we would welcome more funding.  “Replications are valuable.” (p. 149).  Let’s also not mention that social psychologists have been the leader of requesting replication studies. No single study article shall be published in a social psychology journal. A minimum of three studies with conceptual replications of the key finding are needed to show that the results are robust and always produce significant results with p < .05 (or at least p < .10).  Yes, no other science has cherished replications as much as social psychology.

And eminent social psychologists Crandall and Sherman explain why. “to be a cumulative
and self-correcting enterprise, replications, be their results supportive, qualifying, or contradictory, must occur.”  Indeed, but what explains the 95% success rate of published replications in social psychology.  No need for self-correction, if the predictions are always confirmed.

Surprisingly, however, since 2011 a number of replication studies have been published in obscure journals that fail to replicate results.  This has never happened before and raises some concerns. What is going on here?  Why can these researchers not replicate the original results?  The answer is clear. They are doing it wrong.  “We concur with several authors (Crandall and Sherman, Stroebe) that conceptual replications offer the greatest potential to our field…  Much of the current debate, however, is focused narrowly on direct
or exact replications.” (p. 149). As philosopher know, you cannot step into the same river twice and so you cannot replicate the same study again.  To get a significant result, you need to do a similar, but not an identical replication study.

Another problem with failed replication studies is that these researchers assume that they are doing an exact replication study, but do not test this assumption. “In this light, Fabrigar’s insistence that researchers take more care to demonstrate psychometric invariance is well-placed” (p. 149).  Once more, the superiority of conceptual replication studies is self-evident. When you do a conceptual replication study, psychometric invariance is guaranteed and does not have to be demonstrated. Just one more reason, why conceptual replication studies in social psychology journals produce 95% success rate, whereas misguided exact replication attempts have failure rates of over 50%.

It is also important to consider the expertise of researchers.  Social psychologists often have demonstrated their expertise by publishing dozens of successful, conceptual replications.  In contrast, failed replications are often produced by novices with no track-record of ever producing a successful study.  These vast differences in previous success rate need to be taken into account in the evaluation of replication studies.  “Errors caused by low expertise or inadvertent changes are often catastrophic, in the sense of causing a study to fail completely, as Stroebe nicely illustrates.”

It would be a shame if psychology would start rewarding these replication studies.  Already limited research funds would be diverted to conducting studies that are easy to do, yet to difficult to do correctly for inexperienced researchers away from senior researchers who do difficult novel studies that always work and produced groundbreaking new insights into social phenomena during the “golden age” (p. 150) of social psychology.

The authors also point that failed studies are rarely failed studies. When these studies are properly combined with successful studies in a meta-analysis, the results nearly always show the predicted effect and that it was wrong to doubt original studies simply because replication studies failed to show the effect. “Deeper consideration of the terms “failed” and “underpowered” may reveal just how limited the field is by dichotomous thinking. “Failed” implies that a result at p = .06 is somehow inferior to one at p = .05, a conclusion
that scarcely merits disputation.” (p. 150).

In conclusion, we learn nothing from replication studies. They are a waste of time and resources and can only impede further development of social psychology by means of conceptual replication studies that build on the foundations laid during the “golden age” of social psychology.

4. Differential demands of different research topics

Some studies are easier to replicate than others, and replication failures might be “limited to studies that presented methodological challenges (i.e., that had protocols that were considered difficult to carry out) and that provided opportunities for experimenter bias” (p. 150).  It is therefore better, not to replicate difficult studies or to let original authors with a track-record of success conduct conceptual replication studies.

Moreover, some people have argued that the high succeess rate of original studies is inflated by publication bias (not writing up failed studies) and the use of questionable research practices (run more participants until p < .05).  To ensure that reported successes are real successes, some initiatives call for data sharing, pre-registration of data analysis plans, and a priori power analysis.  Although these may appear to be reasonable suggestions, the authors disagree.  “We worry that reifying any of the various proposals as a “best practice” for research integrity may marginalize researchers and research areas that study phenomena or use methods that have a harder time meeting these requirements.” (p. 150).

They appear to be concerns that researchers who do not preregister data analysis plans or do not share data may be stigmatized. “If not, such principles, no matter how well-intentioned, invite the possibility of discrimination, not only within the field but also by decision-makers who are not privy to these realities.”  (p. 150).

5. Considering broader implications

These are confusing times.  In the old days, the goal of research was clearly defined. Conduct at least three, loosely related , successful studies and write them up with a good story.  During these times, it was not acceptable to publish failed studies to maintain the 95% success rate. This made it hard for researchers who did not understand the rules of publishing only significant results. “Recently, a colleague of ours relayed his frustrating experience of submitting a manuscript that included one null-result study among several studies with statistically significant findings. He was met with rejection after rejection, all the while being told that the null finding weakened the results or confused the manuscript” (p. 151).

It is not clear what researchers should be doing now. Should they now report all of their studies, the good, the bad, and the ugly, or should they continue to present only the successful studies?   What if some researchers continue to publish the good old fashioned way that evolved during the golden age of social psychology and others try to publish results more in accordance with what actually happened in their lab?  “There is currently, a disconnect between what is good for scientists and what is good for science” and nobody is going to change while researchers who report only significant results get rewarded with publications in top journals.






There may also be little need to make major changes. “We agree with Crandall and Sherman, and also Stroebe, that social psychology is, like all sciences, a self-correcting enterprise” (p. 151).   And if social psychology is already self-correcting, it do not need new guidelines how to do research and new replication studies. Rather than instituting new policies, it might be better to make social psychology great again. Rather than publishing means and standard deviations or test statistics that allow data detectives to check results, it might be better to report only whether a result was significant, p < .05, and because 95% of studies are significant and the others are failed studies, we might simply not report any numbers.  False results will be corrected eventually because they will no longer be reported in journals and the old results might have been true even if they fail to replicate today.   The best approach is to fund researchers with a good track record of success and let them publish in the top journals.


Most likely, the replication crisis only exists in the imagination of overly self-critical psychologists. “Social psychologists are often reputed to be among the most severe critics of work within their own discipline” (p. 151).  A healthier attitude is to realize that “we already know a lot; with these practices, we can learn even more” (p. 151).

So, let’s get back to doing research and forget this whole thing that was briefly mentioned in the title called “concerns about reproducibility.”  Who cares that only 25% of social psychology studies from 2008 could be replicated in 2014.  In the meantime, thousands of new discoveries were made and it is time to make more new discoveries. “We should not get so caught up in perfectionistic concerns that they impede the rapid accumulation and dissemination of research findings” (p. 151).

There you have it folks.  Don’t worry about recent failed replications. This is just a normal part of science, especially a science that studies fragile, contextually sensitive phenomena. The results from 2008 do not necessarily replicate in 2014 and the results from 2014 may not replicate in 2018.  What we need is fewer replications. We need permanent research because many effects may disappear the moment they were discovered. This is what makes social psychology so exciting.  If you want to study stable phenomena that replicate decade after decade you might as well become a personality psychologist.






A replicability analysis of”I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning”

Dijksterhuis, A. (2004). I like myself but I don’t know why: Enhancing implicit self-esteem by subliminal evaluative conditioning. JOURNAL OF PERSONALITY AND SOCIAL PSYCHOLOGY,   Volume: 86,   Issue: 2,   Pages: 345-355. 

DOI: 10.1037/0022-3514.86.2.345

There are a lot of articles with questionable statistical results and it seems pointless to single out particular articles.  However, once in a while, an article catches my attention and I will comment on the statistical results in it.  This is one of these articles….

The format of this review highlights why articles like this passed peer-review and are cited at high frequency as if they provided empirical facts.  The reason is a phenomenon called “verbal overshadowing.”   In work on eye-witness testimony, participants first see the picture of a perpetrator. Before the actual line-up task, they are asked to give a verbal description of the tasks.  The verbal description can distort the memory of the actual face and lead to a higher rate of misidentifications.  Something similar happens when researchers read articles. Sometimes they only read abstracts, but even when they read the article, the words can overshadow the actual empirical results. As a result, memory is more strongly influenced by verbal descriptions than by the cold and hard statistical facts.

In the first part, I will present the results of the article verbally without numbers. In the second part, I will present only the numbers.

Part 1:

In the article “I Like Myself but I Don’t Know Why: Enhancing Implicit Self-Esteem by
Subliminal Evaluative Conditioning” Ap Dijksterhuis reports the results of six studies (1-4, 5a, 5b).  All studies used a partially or fully subliminal evaluative conditioning task to influence implicit measures of self-esteem. The abstract states: “Participants were repeatedly presented with trials in which the word I was paired with positive trait terms. Relative to control conditions, this procedure enhanced implicit self-esteem.”  Study 1 used preferences for initials to measure implicit self-esteem. and “results confirmed the hypothesis that evaluative conditioning enhanced implicit self-esteem.” (p. 348). Study 2 modified the control condition and showed that “participants in the conditioned self-esteem condition showed higher implicit self-esteem after the treatment than before the treatment, relative to control participants” (p. 348).  Experiment 3 changed the evaluative conditioning procedure. Now, both the CS and the US (positive trait terms) were
presented subliminally for 17 ms.  It also used the Implicit Association Test to measure implicit self-esteem.  The results showed that “difference in response latency between blocks was much more pronounced in the conditioned self-esteem condition, indicating higher self-esteem” (p. 349).  Study 4 also showed that “participants in the conditioned self-esteem condition exhibited higher implicit self-esteem than participants
in the control condition” (p. 350).  Study 5a and 5b showed that “individuals whose
self-esteem was enhanced seemed to be insensitive to personality feedback, whereas control participants whose self-esteem was not enhanced did show effects of the intelligence feedback.” (p. 352).  The General Discussion section summarizes the results. “In our experiments, implicit self-esteem was enhanced through subliminal evaluative conditioning. Pairing the self-depicting word I with positive trait terms consistently improved implicit self-esteem.” (p. 352).  A final conclusion section points out the potential of this work for enhancing self-esteem. “It is worthwhile to explicitly mention an intriguing aspect of the present work. Implicit self-esteem can be enhanced, at least temporarily, subliminally in about 25 seconds.” (p. 353).


Part 2:

Study Statistic p z OP
1 F(1,76)=5.15 0.026 2.22 0.60
2 F(1,33)=4.32 0.046 2.00 0.52
3 F(1,14)=8.84 0.010 2.57 0.73
4 F(1,79)=7.45 0.008 2.66 0.76
5a F(1,89)=4.91 0.029 2.18 0.59
5b F(1,51)=4.74 0.034 2.12 0.56

All six studies produced statistically significant results. To achieve this outcome two conditions have to be met: (a) the effect exists and (b) sampling error is small to avoid a failed study  (i.e., a non-significant result even though the effect is real).   The probability of obtaining a significant result is called power. The last column shows observed power. Observed power can be used to estimate the actual power of the six studies. Median observed power is 60%.  With 60% power, we would expect that only 60% of the 6 studies (3.6 studies) produce a significant result, but all six studies show a significant result.  The excess of significant result shows that the results in this article present an overly positive picture of the robustness of the effect.  If these six studies were replicated exactly, we would not expect to obtain six significant results again.  Moreover, the inflation of significant results also leads to an inflation of the power estimate. The R-Index corrects for this inflation by subtracting the inflation rate (100% observed success rate – 60% median observed power) from the power estimate.  The R-Index is .60 – .40 = .20.  Results with such a low R-Index often do not replicate in independent replication attempts.

Another method to examine the replicability of these results is to examine the variability of the z-scores (second last column).  Each z-score reflects the strength of evidence against the null-hypothesis. Even if the same study is replicated, this measure will vary as a function of random sampling.  The expected variance is approximately 1 (the standard deviation of a standard normal distribution).  Low variance suggests that future studies will produce more variable results and with p-values close to .05, this means that future studies are expected to produce non-significant results.  This bias test is called the Test of Insufficient Variance (TIVA).  The variance of the z-scores is Var(z) = 0.07.  The probability of this restricted variance to occur by chance is p = .003 (1/300).

Based on these results, the statistical evidence presented in this article is questionable and does not provide support for the conclusion that subliminal evaluative conditioning can enhance implicit self-esteem.  Another problem with this conclusion is that implicit self-esteem measures have low reliability and low convergent validity.  As a result, we would not expect strong and consistent effects of any experimental manipulation on these measures.  Finally, even if a small and reliable effect could be obtained, it remains an open question whether this effect shows an effect on implicit self-esteem or whether the manipulation produces a systematic bias in the measurement of implicit self-esteem.  “It is not yet known how long the effects of this manipulation last. In addition, it is not yet
known whether people who could really benefit from enhanced self-esteem (i.e., people with problematically low levels of self-esteem) can benefit from subliminal conditioning techniques.” (p. 353).  12 years later, we may wonder whether these results have been replicated in other laboratories and whether these effects last more than a few minutes after the conditioning experiment.

If you like Part I better, feel free to boost your self-esteem here.



Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.


Peer-Reviews from Psychological Methods

Times are changing. Media are flooded with fake news and journals are filled with fake novel discoveries. The only way to fight bias and fake information is full transparency and openness.
Jerry Brunner and I wrote a paper that examined the validity of z-curve, the method underlying powergraphs, to Psychological Methods.

As soon as we submitted it, we made the manuscript and the code available. Nobody used the opportunity to comment on the manuscript. Now we got the official reviews.

We would like to thank the editor and reviewers for spending time and effort on reading (or at least skimming) our manuscript and writing comments.  Normally, this effort would be largely wasted because like many other authors we are going to ignore most of their well-meaning comments and suggestions and try to publish the manuscript mostly unchanged somewhere else. As the editor pointed out, we are hopeful that our manuscript will eventually be published because 95% of written manuscripts get eventually published. So, why change anything.  However, we think the work of the editor and reviewers deserves some recognition and some readers of our manuscript may find them valuable. Therefore, we are happy to share their comments for readers interested in replicabilty and our method of estimating replicability from test statistics in original articles. 


Dear Dr. Brunner,

I have now received the reviewers’ comments on your manuscript. Based on their analysis and my own evaluation, I can no longer consider this manuscript for publication in Psychological Methods. There are two main reasons that I decided not to accept your submission. The first deals with the value of your statistical estimate of replicability. My first concern is that you define replicability specifically within the context of NHST by focusing on power and p-values. I personally have fewer problems with NHST than many methodologists, but given the fact that the literature is slowly moving away from this paradigm, I don’t think it is wise to promote a method to handle replicability that is unusable for studies that are conducted outside of it. Instead of talking about replicability as estimating the probability of getting a significant result, I think it would be better to define it in more continuous terms, focusing on how similar we can expect future estimates (in terms of effect sizes) to be to those that have been demonstrated in the prior literature. I’m not sure that I see the value of statistics that specifically incorporate the prior sample sizes into their estimates, since, as you say, these have typically been inappropriately low.

Sure, it may tell you the likelihood of getting significant results if you conducted a replication of the average study that has been done in the past. But why would you do that instead of conducting a replication that was more appropriately powered?

Reviewer 2 argues against the focus on original study/replication study distinction, which would be consistent with the idea of estimating the underlying distribution of effects, and from there selecting sample sizes that would produce studies of acceptable power. Reviewer 3 indicates that three of the statistics you discussed are specifically designed for single studies, and are no longer valid when applied to sets of studies, although this reviewer does provide information about how these can be corrected.

The second main reason, discussed by Reviewer 1, is that although your statistics may allow you to account for selection biases introduced by journals not accepting null results, they do not allow you to account for selection effects prior to submission. Although methodologists will often bring up the file drawer problem, it is much less of an issue than people believe. I read about a survey in a meta-analysis text (I unfortunately can’t remember the exact citation) that indicated that over 95% of the studies that get written up eventually get published somewhere. The journal publication bias against non-significant results is really more an issue of where articles get published, rather than if they get published. The real issue is that researchers will typically choose not to write up results that are non-significant, or will suppress non-significant findings when writing up a study with other significant findings. The latter case is even more complicated, because it is often not just a case of including or excluding significant results, but is instead a case where researchers examine the significant findings they have and then choose a narrative that makes best use of them, including non-significant findings when they are part of the story but excluding them when they are irrelevant. The presence of these author-side effects means that your statistic will almost always be overestimating the actual replicability of a literature.

The reviewers bring up a number of additional points that you should consider. Reviewer 1 notes that your discussion of the power of psychological studies is 25 years old, and therefore likely doesn’t apply. Reviewer 2 felt that your choice to represent your formulas and equations using programming code was a mistake, and suggests that you stick to standard mathematical notation when discussing equations. Reviewer 2 also felt that you characterized researcher behaviors in ways that were more negative than is appropriate or realistic, and that you should tone down your criticisms of these behaviors. As a grant-funded researcher, I can personally promise you that a great many researchers are concerned about power,since you cannot receive government funding without presenting detailed power analyses. Reviewer 2 noted a concern with the use of web links in your code, in that this could be used to identify individuals using your syntax. Although I have no suspicions that you are using this to keep track of who is reviewing your paper, you should remove those links to ensure privacy. Reviewer 1 felt that a number of your tables were not necessary, and both reviewers 2 and 3 felt that there were parts of your writing that could be notably condensed. You might consider going through the document to see if you can shorten it while maintaining your general points. Finally, reviewer 3 provides a great many specific comments that I feel would greatly enhance the validity and interpretability of your results. I would suggest that you attend closely to those suggestions before submitting to another journal.

For your guidance, I append the reviewers’ comments below and hope they will be useful to you as you prepare this work for another outlet.

Thank you for giving us the opportunity to consider your submission.

Sincerely, Jamie DeCoster, PhD
Associate Editor
Psychological Methods


Reviewers’ comments:

Reviewer #1:

The goals of this paper are admirable and are stated clearly here: “it is desirable to have an alternative method of estimating replicability that does not require literal replication. We see this method as complementary to actual replication studies.”

However, I am bothered by an assumption of this paper, which is that each study has a power (for example, see the first two paragraphs on page 20). This bothers me for several reasons. First, any given study in psychology will often report many different p-values. Second, there is the issue of p-hacking or forking paths. The p-value, and thus the power, will depend on the researcher’s flexibility in analysis. With enough researcher degrees of freedom, power approaches 100% no matter how small the effect size is. Power in a preregistered replication is a different story. The authors write, “Selection for significance (publication bias) does not change the power values of individual studies.” But to the extent that there is selection done _within_ a study–and this is definitely happening–I don’t think that quoted sentence is correct.

So I can’t really understand the paper as it is currently written, as it’s not clear to me what they are estimating, and I am concerned that they are not accounting for the p-hacking that is standard practice in published studies.

Other comments:

The authors write, “Replication studies ensure that false positives will be promptly discovered when replication studies fail to confirm the original results.” I don’t think “ensure” is quite right, since any replication is itself random. Even if the null is true, there is a 5% chance that a replication will confirm just by chance. Also many studies have multiple outcomes, and if any appears to be confirmed, this can be taken as a success. Also, replications will not just catch false positives, they will also catch cases where the null hypothesis is false but where power is low. Replication may have the _goal_ of catching false positives, but it is not so discriminating.

The Fisher quote, “A properly designed experiment rarely fails to give …significance,” seems very strange to me. What if an experiment is perfectly designed, but the null hypothesis happens to be true? Then it should have a 95% chance of _not_ giving significance.

The authors write, “Actual replication studies are needed because they provide more information than just finding a significant result again. For example, they show that the results can be replicated over time and are not limited to a specific historic, cultural context. They also show that the description of the original study was sufficiently precise to reproduce the study in a way that it successfully replicated the original result.” These statements seem too strong to me. Successful replication is rejection of the null, and this can happen even if the original study was not described precisely, etc.

The authors write, “A common estimate of power is that average power is about 50% (Cohen 1962, Sedlmeier and Gigerenzer 1989). This means that about half of the studies in psychology have less than 50% power.” I think they are confusing the mean with the median here. Also I would guess that 50% power is an overestimate. For one thing, psychology has changed a lot since 1962 or even 1989 so I see no reason to take this 50% guess seriously.

The authors write, “We define replicability as the probability of obtaining the same result in an exact replication study with the same procedure and sample sizes.” I think that by “exact” they mean “pre-registered” but this is not clear. For example, suppose the original study was p-hacked. Then, strictly speaking, an exact replication would also be p-hacked. But I don’t think that’s what the authors mean. Also, it might be necessary to restrict the definition to pre-registered studies with a single test. Otherwise there is the problem that a paper has several tests, and any rejection will be taken as a successful replication.

I recommend that the authors get rid of tables 2-15 and instead think more carefully about what information they would like to convey to the reader here.

Reviewer #2:

This paper is largely unclear, and in the areas where it is clear enough to decipher, it is unwise and unprofessional.

This study’s main claim seems to be: “Thus, statistical estimates of replicability and the outcome of replication studies can be seen as two independent methods that are expected to produce convergent evidence of replicability.” This is incorrect. The approaches are unrelated. Replication of a scientific study is part of the scientific process, trying to find out the truth. The new study is not the judge of the original article, its replicability, or scientific contribution. It is merely another contribution to the scientific literature. The replicator and the original article are equals; one does not have status above the other. And certainly a statistical method applied to the original article has no special status unless the method, data, or theory can be shown to be an improvement on the original article.

They write, “Rather than using traditional notation from Statistics that might make it difficult for non-statisticians to understand our method, we use computer syntax as notation.” This is a disqualifying stance for publication in a serious scholarly journal, and it would an embarrassment to any journal or author to publish these results. The point of statistical notation is clarity, generality, and cross-discipline understanding. Computer syntax is specific to the language adopted, is not general, and is completely opaque to anyone who uses a different computer language. Yet everyone who understands their methods will have at least seen, and needs to understand, statistical notation. Statistical (i.e., mathematical) notation is the one general language we have that spans the field and different fields. No computer syntax does this. Proofs and other evidence are expressed in statistical notation, not computer syntax in the (now largely unused) S statistical language. Computer syntax, as used in this paper, is also ill-defined in that any quantity defined by a primitive function of the language can change any time, even after publication, if someone changes the function. In fact, the S language, used in this paper, is not equivalent to R, and so the authors are incorrect that R will be more understandable. Not including statistical notation, when the language of the paper is so unclear and self-contradictory, is an especially unfortunate decision. (As it happens I know S and R, but I find the manuscript very difficult to understand without imputing my own views about what the authors are doing. This is unacceptable. It is not even replicable.) If the authors have claims to make, they need to state them in unambiguous mathematical or statistical language and then prove their claims. They do not do any of these things.

It is untrue that “researchers ignore power”. If they do, they will rarely find anything of interest. And they certainly write about it extensively. In my experience, they obsess over power, balancing whether they will find something with the cost of doing the experiment. In fact, this paper misunderstands and misrepresents the concept: Power is not “the long-run probability of obtaining a statistically significant result.” It is the probability that a statistical test will reject a false null hypothesis, as the authors even say explicitly at times. These are very different quantities.

This paper accuses “researchers” of many other misunderstandings. Most of these are theoretically incorrect or empirically incorrect.One point of the paper seems to be “In short, our goal is to estimate average power of a set of studies with unknown population effect sizes that can assume any value, including zero.” But I don’t see why we need to know this quantity or how the authors’ methods contribute to us knowing it. The authors make many statistical claims without statistical proofs, without any clear definition of what their claims are, and without empirical evidence. They use simulation that inquires about a vanishingly small portion of the sample space to substitute for an infinite domain of continuous parameter values; they need mathematical proofs but do not even state their claims in clear ways that are amenable to proof.

No coherent definition is given of the quantity of interest. “Effect size” is not generic and hypothesis tests are not invariant to the definition, even if it is true that they are monotone transformations of each other. One effect size can be “significant” and a transformation of the effect size can be “not significant” even if calculated from the same data. This alone invalidates the authors’ central claims.

The first 11.5 pages of this paper should be summarized in one paragraph. The rest does not seem to contribute anything novel. Much of it is incorrect as well. Better to delete throat clearing and get on with the point of the paper.

I’d also like to point out that the authors have hard-coded URL links to their own web site in the replication code. The code cannot be run without making a call to the author’s web site, and recording the reviewer’s IP address in the authors’ web logs. Because this enables the authors to track who is reviewing the manuscript, it is highly inappropriate. It also makes it impossible to replicate the authors results. Many journals (and all federal grants) have prohibitions on this behavior.

I haven’t checked whether Psychological Methods has this rule, but the authors should know better regardless.

Reviewer 3

Review of “How replicable is psychology? A comparison of four methods of estimating replicability on the bias of test statistics in original studies”

It was my pleasure to review this manuscript. The authors compare four methods of estimating replicability. One undeniable strength of the general approach is that these measures of replicability can be computed before or without actually replicating the study/studies. As such, one can see the replicability measure of a set of statistically significant findings as an index of trust in these findings, in the sense that the measure provides an estimate of the percentage of these studies that is expected to be statistically significant when replicating them under the same conditions and same sample size (assuming the replication study and the original study assess the same true effect). As such, I see value in this approach. However, I have many comments, major and minor, which will enable the authors to improve their manuscript.

Major comments

1. Properties of index.

What I miss, and what would certainly be appreciated by the reader, is a description of properties of the replicability index. This would include that it has a minimum value equal to 0.05 (or more generally, alpha), when the set of statistically significant studies has no evidential value. Its maximum value equals 1, when the power of studies included in the set was very large. A value of .8 corresponds to the situation where statistical power of the original situation was .8, as often recommended. Finally, I would add that both sample size and true effect size affect the replicability index; a high value of say .8 can be obtained when true effect size is small in combination with a large sample size (you can consider giving a value of N, here), or with a large true effect size in combination with a small sample size (again, consider giving values).

Consider giving a story like this early, e.g. bottom of page 6.

2. Too long explanations/text

Perhaps it is a matter of taste, but sometimes I consider explanations much too long. Readers of Psychological Methods may be expected to know some basics. To give you an example, the text on page 7 in “Introduction of Statistical Methods for Power estimation” is very long. I believe its four paragraphs can be summarized into just one; particularly the first one can be summarized in one or two sentences. Similarly, the section on “Statistical Power” can be shortened considerably, imo. Other specific suggestions for shortening the text, I mention below in the “minor comments” section. Later on I’ll provide one major comment on the tables, and how to remove a few of them and how to combine several of them.

3. Wrong application of ML, p-curve, p-uniform

This is THE main comment, imo. The problem is that ML (Hedges, 1984), p-curve, p-uniform, enable the estimation of effect size based on just ONE study. Moreover,  Simonsohn (p-curve) as well as the authors of p-uniform would argue against estimating the average effect size of unrelated studies. These methods are meant to meta-analyze studies on ONE topic.

4. P-uniform and p-curve section, and ML section

This section needs a major revision. First, I would start the section with describing the logic of the method. Only statistically significant results are selected. Conditional on statistical significance, the methods are based on conditional p-values (not just p-values), and then I would provide the formula on top of page 18. Most importantly, these techniques are not constructed for estimating effect size of a bunch of unrelated studies. The methods should be applied to related studies. In your case, to each study individually. See my comments earlier.

Ln(p), which you use in your paper, is not a good idea here for two reasons: (1) It is most sensitive to heterogeneity (which is also put forward by Van Assen et al (2014), and (2) applied to single studies it estimates effect size such that the conditional p-value equals 1/e, rather than .5  (resulting in less nice properties).

The ML method, as it was described, focuses on estimating effect size using one single study (see Hedges, 1984). So I was very surprised to see it applied differently by the authors. Applying ML in the context of this paper should be the same as p-uniform and p-curve, using exactly the same conditional probability principle. So, the only difference between the three methods is the method of optimization. That is the only difference.

You develop a set-based ML approach, which needs to assume a distribution of true effect size. As said before, I leave it up to you whether you still want to include this method. For now, I have a slight preference to include the set-based approach because it (i) provides a nice reference to your set-based approach, called z-curve, and (ii) using this comparison you can “test” how robust the set-based ML approach is against a violation of the assumption of the distribution of true effect size.

Moreover, I strongly recommend showing how their estimates differ for certain studies, and include this in a table. This allows you to explain the logic of the methods very well. Here a suggestion. I would provide the estimates of four methods (…) for p-values .04, .025, .01, .001, and perhaps .0001). This will be extremely insightful. For small p-values, the three methods&rsquo; estimates will be similar to the traditional estimate. For p-values > .025, the estimate will be negative, for p = .025 the estimate will be (close to) 0. Then, you can also use these same studies and p-values to calculate the power of a replication study (R-index).

I would exclude Figure 1, and the corresponding text. Is not (no longer) necessary.

For the set-based ML approach, if you still include it, please explain how you get to the true value distribution (g(theta)).

5a. The MA set, and test statistics

Many different effect sizes and test statistics exist. Many of them can be transformed to ONE underlying parameter, with a sensible interpretation and certain statistical properties. For instance, the chi2, t, and F(1,df) can all be transformed to d or r, and their SE can be derived. In the RPP project and by John et al (2016) this is called the MA set. Other test statistics, such as F(>1, df) cannot be converted to the same metric, and no SE is defined on that metric. Therefore, the statistics F(>1,df) were excluded from the meta-analyses in the RPP  (see the supplementary materials of the RPP) and by Johnson et al (2016) and also Morey and Lakens (2016), who also re-analyzed the data of the RPP.

Fortunately, in your application you do not estimate effect size but only estimate power of a test, which only requires estimating the ncp and not effect size. So, in principle you can include the F(>1,df) statistics in your analyses, which is a definite advantage. Although I can see you can incorporate it for the ML, p-curve, p-uniform approach, I do not directly see how these F(>1,df) statistics can be used for the two set-based methods (ML and z-curve); in the set-based methods, you put all statistics on one dimension (z) using the p-values. How do you defend this?

5b. Z-curve

Some details are not clear to me, yet. How many components (called r in your text) are selected, and why? Your text states: “First, select a ncp parameter m ; . Then generate Z from a normal distribution with mean m ; I do not understand, since the normal distribution does not have an ncp. Is it that you nonparametrically model the distribution of observed Z, with different components?

Why do you use kernel density estimation? What is it’s added value? Why making it more imprecise by having this step in between? Please explain.

Except for these details, procedure and logic of z-curve are clear

6. Simulations (I): test statistics

I have no reasons, theoretical or empirical, why the analyses would provide different results for Z, t, F(1,df), F(>1,df), chi2. Therefore, I would omit all simulation results of all statistics except 1, and not talk about results of these other statistics. For instance, in the simulations section I would state that results are provided on each of these statistics but present here only the results of t, and of others in supplementary info. When applying the methods to RPP, you apply them to all statistics simultaneously, which you could mention in the text (see also comment 4 above).

7. mean or median power (important)

One of my most important points is the assessment of replicability itself. Consider a set of studies for which replicability is calculated, for each study. So, in case of M studies, there are M replicability indices. Which statistics would be most interesting to report, i.e., are most informative? Note that the distribution of power is far some symmetrical, and actually may be bimodal with modes at 0.05 and 1.  For that reason alone, I would include in any report of replicability in a field the proportion of R-indices equal to 0.05 (which amounts to the proportion of results with .025 < p < .05) and the proportion of R-indices equal to 1.00 (e.g., using two decimals, i.e. > .995). Moreover, because power values are recommend of .8 or more, I also could include the proportion of studies with power > .8.

We also would need a measure of central tendency. Because the distribution is not symmetric, and may be skewed, I recommend using the median rather than the mean. Another reason to use the median rather than the mean is because the mean does not provide useable information on whether methods are biased or not, in the simulations. For instance, if true effect size = 0, because of sampling error the observed power will exceed .05 in exactly 50% of the cases (this is the case for p-uniform; since with probability .5 the p-value will exceed .025) and larger than .05 in the other 50% of the cases. Hence, the median will be exactly equal to .05, whereas the mean will exceed .05. Similarly, if true effect size is large the mean power will be too small (distribution skewed to the left). To conclude, I strongly recommend including the median in the results of the simulation.

In a report, such as for the RPP later on in the paper, I recommend including (i)

p(R=.05), (ii) p(R >= .8), (iii) p(R>= .995), (iv) median(R), (v) sd(R), (vi)

distribution R, (vii) mean R. You could also distinguish this for soc psy and cog psy.

8. simulations (II): selection of conditions

I believe it is unnatural to select conditions based on “mean true power” because we are most familiar with effect size and their distribution, and sample sizes and their distribution. I recommend describing these distributions, and then the implied power distribution (surely the median value as well, not or not only the mean).

9.  Omitted because it could reveal identity of reviewer

10. Presentation of results

I have comments on what you present, on how you present the results. First, what you present. For the ML and p-methods, I recommend presenting the distribution of R in each of the conditions (at least for fixed true effect size and fixed N, where results can be derived exactly relatively easy). For the set-based methods, if you focus on average R (which I do not recommend, I recommend median R), then present the RMSE. The absolute median error is minimized when you use the median. So, average-RMSE is a couple, and median-absolute error is a couple.

Now the presentation of results. Results of p-curve/p-uniform/ML are independent of the number of tests, but set-based methods (your ML variant) and z-curve are not.

Here the results I recommend presenting:

Fixed effect size, heterogeneity sample size

**For single-study methods, the probability distribution of R (figure), including mean(R), median(R), p(R=.05), p(R>= .995), sd(R). You could use simulation for approximating this distribution. Figures look like those in Figure 3, to the right.

**Median power, mean/sd as a function of K

**Bias for ML/p-curve/p-uniform amounts to the difference between median of distribution and the actual median, or the difference between the average of the distribution and the actual average. Note that this is different from set-based methods.

**For set-based methods, a table is needed (because of its dependence on k).

Results can be combined in one table (i.e., 2-3, 5-6, etc)

Significance tests comparing methods

I would exclude Table 4, Table 7, Table 10, Table 13. These significance tests do not make much sense. One method is better than another, or not – significance should not be relevant (for a very large number of iterations, a true difference will show up). You could simply describe in the text which method works best.

Heterogeneity in both sample size and effect size

You could provide similar results as for fixed effect size (but not for chi2, or other statistics). I would also use the same values of k as for the fixed effect case. For the fixed effect case you used 15, 25, 50, 100, 250. I can imagine using as values of k for both conditions k = 10, 30, 100, 400, 2,000 (or something).

Including the k = 10 case is important, because set-based methods will have more problems there, and because one paper or a meta-analysis or one author may have published just one or few statistically significant effect sizes. Note, however, that k=2,000 is only realistic when evaluating a large field.

Simulation of complex heterogeneity

Same results as for fixed effect size and heterogeneity in both sample size and effect size. Good to include a condition where the assumption of set-based ML is violated. I do not yet see why a correlation between N and ES may affect the results. Could you explain? For instance, for the ML/p-curve/p-uniform methods, all true effect sizes in combination with N result in a distribution of R for different studies; how this distribution is arrived at, is not relevant, so I do not yet see the importance of this correlation. That is, this correlation should only affect the results through the distribution of R. More reasoning should be provided, here.

Simulation of full heterogeneity

I am ambivalent about this section. If test statistic should not matter, then what is the added value of this section? Other distributions of sample size may be incorporated in previous section “complex heterogeneity”;. Other distributions of true effect may also be incorporated in the previous section. Note that Johnson et al (2016) use the RPP data to estimate that 90% of effects in psychology estimate a true zero effect. You assume only 10%.

Conservative bootstrap

Why only presenting the results of z-curve? By changing the limits of the interval, the interpretation becomes a bit awkward; what kind of interval is it now? Most importantly, coverages of .9973 or .9958 are horrible (in my opinion, these coverages are just as bad as coverages of .20). I prefer results of 95% confidence intervals, and then show their coverages in the table. Your &lsquo;conservative&rsquo; CIs are hard to interpret. Note also that this is paper on statistical properties of the methods, and one property is how well the methods perform w.r.t. 95% CI.

By the way, examining 95% CI of the methods is very valuable.

11. RPP

In my opinion, this section should be expanded substantially. This is where you can finally test your methodology, using real data! What I would add is the following: **Provide the distribution of R (including all statistics mentioned previously, i.e. p(R=0.05), p(R >= .8), p(R >= .995), median(R), mean(R), sd(R), using single-study methods **Provide previously mentioned results for soc psy and cog psy separately **Provide results of z-curve, and show your kernel density curve (strange that you never show this curve, if it is important in your algorithm).  What would be really great, is if you predict the probability of replication success (power) using the effect size estimate based on the original effect size estimated (derived from a single study) and the N of the replication sample. You could make a graph with on the X-axis this power, and on the Y-axis the result of the replication. Strong evidence in favor of your method would be if your result better predicts future replicability than any other index (see RPP for what they tried). Logistic regression seems to be the most appropriate technique for this.

Using multiple logistic regression, you can also assess if other indices have an added value above your predictions.

To conclude, for now you provide too limited results to convince readers that your approach is very useful.

Minor comments

P4 top: “heated debates” A few more sentence on this debate, including references to those debates would be fair. I would like to mention/recommend the studies of Maxwell et al (2015) in American Psychologist, the comment on the OSF piece in Science, and its response, and the very recent piece of Valen E Johnson et al (2016).

P4, middle: consider starting a new paragraph at “Actual replication”; In the sentence after this one, you may add “or not”;.

Another advantage of replication is that it may reveal heterogeneity (context dependence). Here, you may refer to the ManyLabs studies, which indeed reveal heterogeneity in about half of the replicated effects. Then, the next paragraph may start with “At the same time” To conclude, this piece starting with “Actual replication”; can be expanded a bit

P4, bottom,  “In contrast”; This and the preceding sentence is formulated as if sampling error does not exist. It is much too strong! Moreover, if the replication study had low power, sampling error is likely the reason of a statistically insignificant result. Here you can be more careful/precise. The last sentence of this paragraph is perfect.

P5, middle: consider adding more refs on estimates of power in psychology, e.g. Bakker and Wicherts 35% and that study on neuroscience with power estimates close to 20%. Last sentence of the same paragraph; assuming same true effect and same sample size.

P6, first paragraph around Rosenthal. Consider referring to the study of Johson et al (2016), who used a Bayesian analysis to estimate how many non-significant studies remain unpublished.

P7, top: &ldquo;studies have the same power (homogenous case) “(heterogenous case). This is awkward. Homogeneity and heterogeneity is generally reserved for variation in true effect size. Stick to that. Another problem here is that “heterogeneous”; power can be created by “heterogeneity”; in sample size and/or heterogeneity in effect size. These should be distinguished, because some methods can deal with heterogeneous power caused by heterogeneous N, but not heterogeneous true effect size. So, here, I would simple delete the texts between brackets.

P7, last sentence of first paragraph; I do not understand the sentence.

P10, “average power”. I did not understand this sentence.

P10, bottom: Why do you believe these methods to be most promising?

P11, 2nd par: Rephrase this sentence. Heterogeneity of effect size is not because of sampling variation. Later in this paragraph you also mix up heterogeneity with variation in power again. Of course, you could re-define heterogeneity, but I strongly recommend not doing so (in order not to confuse others); reserve heterogeneity to heterogeneity in true effect size.

P11, 3rd par, 1st sentence: I do not understand this sentence. But then again, this sentence may not be relevant (see major comments), because for applying p-uniform and p-curve heterogeneity of effect size is not relevant.

P11 bottom: maximum likelihood method. This sentence is not specific enough. But then again, this sentence may not be relevant (see major comments).

P12: Statistics without capital.

P12: “random sampling distribution”; delete “random”;. By the way, I liked this section on Notation and statistical background.

Section “Two populations of power”;. I believe this section is unnecessarily long, with a lot of text. Consider shortening. The spinning wheel analogy is ok.

P16, “close to the first” You mean second?

P16, last paragraph, 1st sentence: English?

Principle 2: The effect on what? Delete last sentence in the principle.

P17, bottom: include the average power after selection in your example.

p-curve/p-uniform: modify, as explained in one of the major comments.

P20, last sentence: Modify the sentence – the ML approach has excellent properties asymptotically, but not sample size is small. Now it states that it generally yields more precise estimates.

P25, last sentence of 4. Consider deleting this sentence (does not add anything useful).

P32: “We believe that a negative correlation between” some part of sentence is missing.

P38, penultimate sentence: explain what you mean by “decreasing the lower limit by .02”; and “increasing the upper limit by .02”;.


How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Manuscript under review, copyright belongs to Jerry Brunner and Ulrich Schimmack

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Jerry Brunner and Ulrich Schimmack
University of Toronto @ Mississauga

In the past five years, the replicability of original findings published in psychology journals has been questioned. We show that replicability can be estimated by computing the average power of studies. We then present four methods that can be used to estimate average power for a set of studies that were selected for significance: p-curve, p-uniform, maximum likelihood, and z-curve. We present the results of large-scale simulation studies with both homogeneous and heterogeneous effect sizes. All methods work well with homogeneous effect sizes, but only maximum likelihood and z-curve produce accurate estimates with heterogeneous effect sizes. All methods overestimate replicability using the Open Science Collaborative reproducibility project and we discuss possible reasons for this. Based on the simulation studies, we recommend z-curve as a valid method to estimate replicability. We also validated a conservative bootstrap confidence interval that makes it possible to use z-curve with small sets of studies.

Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, P-curve, P-uniform, Z-curve, Effect size, Replicability, Simulation.

Link to manuscript:  http://www.utstat.utoronto.ca/~brunner/zcurve2016/HowReplicable.pdf

Link to website with technical supplement:




A Critical Review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”

In this review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”, I present verbatim quotes form their chapter and explain why these statements are misleading or false, and how the authors distort the actual evidence by selectively citing research that supports their claims, while hiding evidence that contradicts their claims. I show that the empirical evidence for the claims made by Schwarz and Strack is weak and biased.

Unfortunately, this chapter has had a strong influence on Daniel Kahneman’s attitude towards life-satisfaction judgments and his fame as a Noble laureate has led many people to believe that life-satisfaction judgments are highly sensitive to the context in which these questions are asked and practically useless for the measurement of well-being.  This has led to claims that wealth is not a predictor of well-being, but only a predictor of invalid life-satisfaction judgments (Kahneman et al., 2006) or that the effects of wealth on well-being are limited to low incomes.  None of these claims are valid because they rely on the unsupported assumption that life-satisfaction judgments are invalid measures of well-being.

The original quotes are highlighted in bold followed by my comments.

Much of what we know about individuals’ subjective well-being (SWB) is based on self-reports of happiness and life satisfaction.

True. The reason is that sociologists developed brief, single-item measures of well-being that could be included easily in large surveys such as the World Value Survey, the German Socio-Economic Panel, or the US General Social Survey.  As a result, there is a wealth of information about life-satisfaction judgments that transcends scientific disciplines. The main contribution of social psychologists to this research program that examines how social factors influence human well-being has been to dismiss the results based on claims that the measure of well-being is invalid.

As Angus Campbell (1981) noted, the “use of these measures is based on the assumption that all the countless experiences people go through from day to day add to . . . global feelings of well-being, that these feelings remain relatively constant over extended periods, and that people can describe them with candor and accuracy”

Half true.  Like all self-report measures, the validity of life-satisfaction judgments depends on respondents’ ability and willingness to provide accurate information.  However, it is not correct to suggest that life-satisfaction judgments assume that feelings remain constant over extended periods of time or that respondents have to rely on feelings to answer questions about their satisfaction with life.  There is a long tradition in the well-being literature to distinguish cognitive measures of well-being like Cantrill’s ladder and affective measures that focus on affective experiences in the recent past like Bradburn’s affect balance scale.  The key assumption underlying life-satisfaction judgments is that respondents have chronically accessible information about their lives or can accurately estimate the frequency of positive and negative feelings. It is not necessary that the feelings are stable.

These assumptions have increasingly been drawn into question, however, as the empirical work has progressed.

It is not clear which assumptions have been drawn into question.  Are people unwilling to report their well-being, are they unable to do so, or are feelings not as stable as they are assumed to be? Moreover, the statement ignores a large literature that has demonstrated validity of well-being measures going back to the 1930s (see Diener et al., 2009; Scheider & Schimmack, 2009, for a meta-analysis).

First, the relationship between individuals’ experiences and objective conditions of life and their subjective sense of well-being is often weak and sometimes counter-intuitive.  Most objective life circumstances account for less than 5 percent of the variance in measures of SWB, and the combination of the circumstances in a dozen domains of life does not account for more than 10 percent (Andrews and Whithey 1976; Kammann; 1982; for a review, see Argyle, this volume).


First, it is not clear what weak means. How strong should the correlation between objective conditions of life and subjective well-being be?  For example, should marital status be a strong predictor of happiness? Maye it matters more whether people are happily married or unhappily married than whether they are married or single?  Second, there is no explanation for the claim that these relationships are counter-intuitive.  Employment, wealth, and marriage are positively related to well-being as most people would expect. The only finding in the literature that may be considered counter-intuitive is that having children does not notably increase well-being and sometimes decreases well-being. However, this does not mean well-being measures are false, it may mean that people’s intuitions about the effects of life-events on well-being are wrong. If intuitions would always be correct, we would not need scientific studies of determinants of well-being.


Second, measures of SWB have low test-retest reliabilities, usually hovering around .40, and not exceeding .60 when the same question is asked twice during the same one-hour interview (Andrews and Whithey 1976; Glatzer 1984). 


This argument ignores that responses to a single self-report item often have a large amount of random measurement error, unless participants can recall their previous answer.  The typical reliability of a single-item self-report measure is about r  =.6 +/- .2.  There is nothing unique about the results reported here for well-being measures. Moreover, the authors blatantly ignore evidence that scales with multiple items like Diener’s Satisfaction with Life Scale have retest correlations over r = .8 over a one-month period (see Schimmack & Oishi, 2005, for a meta-analysis).  Thus, this statement is misleading and factually incorrect.


Moreover, these measures are extremely sensitive to contextual influences.


This claim is inconsistent with the high retest correlation over periods of one month. Moreover, survey researchers have conducted numerous studies in which they examined the influence of the survey context on well-being measures and a meta-analysis of these studies shows only a small effect of previous items on these judgments and the pattern of results is not consistent across studies (see Schimmack & Oishi, 2005 for a meta-analysis).


Thus, minor events, such as finding a dime (Schwarz 1987) or the outcome of soccer games (Schwarz et al. 1987), may profoundly affect reported satisfaction with one’s life as a whole.


As I will show, the chapter makes many statements about what may happen.  For example, finding a dime may profoundly affect well-being report or it may not have any effect on these judgments.  These statements are correct because well-being reports can be made in many different ways. The real question is how these judgments are made when well-being measures are used to measure well-being. Experimental studies that manipulate the situation cannot answer this question because they purposefully create the situation to demonstrate that respondents may use mood (when mood is manipulated) or may use information that is temporarily accessible, when relevant information is made salient and temporarily accessible. The processes underlying judgments in these experiments may reveal influences on life-satisfaction judgment in a real survey context or they may reveal processes that do not occur under normal circumstances.


Most important, however, the reports are a function of the research instrument and are strongly influenced by the content of preceding questions, the nature of the response alternatives, and other “technical” aspects of questionnaire design (Schwarz and Strack 1991a, 1991b).


We can get different answers to different questions.  The item “So far, I have gotten everything I wanted in life” may be answered differently than the item “I feel good about my life, these days.”  If so, it is important to examine which of these items is a better measure of well-being.  It does not imply that all well-being items are flawed.  The same logic applies to the response format.  If some response formats produce different results than others, it is important to determine which response formats are better for the measurement of well-being.  Last, but not least, the claim that well-being reports are “strongly influenced by the content of preceding questions” is blatantly false.  A meta-analysis shows that strong effects were only observed in two studies by Strack, but that other studies find much weaker or no effects (see Schimmack & Oishi, 2005, for a meta-analysis).


Such findings are difficult to reconcile with the assumption that subjective social indicators directly reflect stable inner states of well-being (Campbell 1981) or that the reports are based on careful assessments of one’s objective conditions in light of one’s aspirations (Glatzer and Zapf 1984). Instead, the findings suggest that reports of SWB are better conceptualized as the result of a judgment process that is highly context-dependent.


Indeed. A selective and bias list of evidence is inconsistent with the hypothesis that well-being reports are valid measures of well-being, but this only shows that the authors misrepresent the evidence, not that well-being reports lack validity, which was carefully examined in Andrew & Withey’s book (1976), which the authors cite without mentioning the evidence presented in the book for the usefulness of well-being reports.




Not surprisingly, individuals may draw on a wide variety of information when asked to assess the subjective quality of their lives.


Indeed. This means that it is impossible to generalize from an artificial context created in an experiment to the normal conditions of a well-being survey because respondents may use different information in the experiment than in the naturalistic context. The experiment may led respondents to use information that they normally would not use.




Comparison-based evaluative judgments require a mental representation of the object of judgment, commonly called a target, as well as a mental representation of a relevant standard to which the target can be compared.


True. In fact, Cantril’s ladder explicitly asks respondents to compare their actual life to the best possible life they could have and the worst possible life they could have.  We can think about these possible lives as imaginary intrapersonal comparisons.


When asked, “Taking all things together, how would you say things are these days?” respondents are ideally assumed to review the myriad of relevant aspects of their lives and to integrate them into a mental representation of their life as a whole.”


True, this is the assumption underlying the use of well-being reports as measures of well-being.


In reality, however, individuals rarely retrieve all information that may be relevant to a judgment


This is also true. It is impossible to retrieve ALL of the relevant information. But it is possible that respondents retrieve most of the relevant information or enough relevant information to make these judgments valid. We do not require 100% validity for measures to be useful.


Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987).


This is also plausible. The question is what would be the criterion for sufficient certainty for well-being judgments and whether this level of certainty is reached without retrieval of relevant information. For example, if I have to report how satisfied I am with my life overall and I am thinking first about my marriage would I stop there or would I think that my overall life is more than my marriage and also think about my work?  Depending on the answer to this question, well-being judgments may be more or less valid.


Hence, the judgment is based on the information that is most accessible at that point in time. In general, the accessibility of information depends on the recency and frequency of its use (for a review, see Higgins 1996).


This also makes sense.  A sick person may think about their health. A person in a happy marriage may think about their loving wife, and a person with financial problems may think about their problems paying bills.  Any life domain that is particularly salient in a person’s life is also likely to be a salient when they are confronted with a life-satisfaction question. However, we still do not know which information people will use and how much information they will use before they consider their judgment sufficiently accurate to provide an answer. Would they use just one salient temporarily accessible piece of information or would be continue to look for more information?


Information that has just been used-for example, to answer a preceding question in the questionnaire-is particularly likely to come to mind later on, although only for a limited time.


Wait a second.  Higgins emphasized that accessibility is driven by recency and frequency (!) of use. Individual who are going through a divorce or cancer treatment have probably thought frequently about this aspect of their lives.  A single question about their satisfaction with their recreational activities may not make them judge their lives based on their hobbies. Thus, it does not follow from Higgins’s work on accessibility that preceding items have a strong influence on well-being judgments.


This temporarily accessible information is the basis of most context effects in survey measurement and results in variability in the judgment when the same question is asked at different times (see Schwarz and Strack 1991b; Strack 1994a; Sudman, Bradburn, and Schwarz 1996, chs. 3 to 5; Tourangeau and Rasinski 1988)


Once more, the evidence for these temporary accessibility effects is weak and it is not clear why well-being judgments would be highly stable over time, if they were driven by making irrelevant information temporarily accessible.  In fact, the evidence is more consistent with Higgins’ suggests that frequent of use influences well-being judgments.  Life domains that are salient to individuals are likely to influence life-satisfaction judgments because they are chronically accessible even if other information is temporarily accessible or primed by preceding questions.


Other information, however, may come to mind because it is used frequently-for example, because it relates to the respondent’s current concerns (Klinger 1977) or life tasks (Cantor and Sanderson, this volume). Such chronically accessible information reflects important aspects of respondents’ lives and provides for some stability in judgments over time.


Indeed, but look at the wording. “This temporarily accessible information IS the basis of most context effects in survey measurement” vs. “Other information, however, MAY come to mind.”  The wording is not balanced and it does not match the evidence that most of the variation in well-being reports across individuals is stable over time and only a small proportion of the variance changes systematically over time. The wording is an example of how some scientists create the illusion of a balanced literature review while pushing their biased opinions.


As an example, consider experiments on question order. Strack, Martin, and Schvvarz (1988) observed that dating frequency was unrelated to students’ life satisfaction when a general satisfaction question preceded a question about the respondent’s dating frequency, r = – 12.  Yet reversing the question order increased the correlation to r = .66.  Similarly, marital satisfaction correlated with general life satisfaction r = .32 when the general question preceded the marital one in another study (Schwarz, Strack, and Mai 1991). Yet reversing the question order again increased this correlation to r = .67.


The studies that are cited here are not representative. They show the strongest item-order effects and the effects are much stronger than the meta-analytic average (Schimmack & Oishi, 2005). Both studies were conducted by Strack. Thus, these examples are at best considered examples what might happen under very specific conditions that differ from other specific conditions where the effect was much smaller. Moreover, it is not clear why dating frequency should be a strong positive predictor of life-satisfaction. Why is my life better when I have a lot of dates as opposed to somebody who is in a steady relationship, and we would not expect a married respondent with lots of dates to be happy with their marriage. The difference between r = .32 and r = .66 is strong, but it was obtained with small samples and it is common that small samples overestimate effect sizes. In fact, large survey studies show much weaker effects. In short, by focusing on these two examples, the authors create the illusion that strong effects of preceding items are common and that these studies are just an example of these effects. In reality, these are the only two studies with extremely and unusually strong effects that are not representative of the literature. The selective use of evidence is another example of unscientific practices that undermine a cumulative science.


Findings of this type indicate that preceding questions may bring information to mind that respondents would otherwise not consider.


Yes, it may happen, but we do not know under what specific circumstances it happens.  At present, the only predictor of these strong effects is that the studies were conducted by Fritz Strack. Nobody else has reported such strong effects.


If this information is included in the representation that the respondent forms of his or her life, the result is an assimilation effect, as reflected in increased correlations. Thus, we would draw very different inferences about the impact of dating frequency or marital satisfaction on overall SWB, depending on the order in which the questions are asked.


Now the authors extrapolate from extreme examples and discuss possible theoretical implications if this were a consistent and replicable finding.  “We would draw different inferences.”  True. If this were a replicable finding and we would ask about specific life domains first, we would end up with false inferences about the importance of dating and marriage for life-satisfaction. However, it is irrelevant what follows logically from a false assumption (if Daniel Kahneman had not won the Noble price, it would be widely accepted that money buys some happiness). Second, it is possible to ask global life-satisfaction question first without making information about specific aspects of life temporarily salient.  This simple procedure would ensure that well-being reports are more strongly influenced on chronically accessible information that reflects people’s life concerns.  After all, participants may draw on chronically accessible information or temporarily accessible information and if no relevant information was made temporarily accessible, respondents will use chronically accessible information.


Theoretically, the impact of a given piece of accessible information increases with its extremity and decreases with the amount and extremity of other information that is temporarily or chronically accessible at the time of judgment (see Schwarz and Bless 1992a). To test this assumption, Schwarz, Strack, and Mai ( 1991) asked respondents about their job satisfaction, leisure time satisfaction, and marital satisfaction prior to assessing their general life satisfaction, thus rendering a more varied set of information accessible. In this case, the correlation between marital satisfaction and life satisfaction increased from r = .32 (in the general-marital satisfaction order) to r = .46, yet this increase was less pronounced than the r = .67 observed when marital satisfaction was the only specific domain addressed.


This finding also suggests that strong effects of temporarily accessible information are highly context dependent. Just asking for satisfaction with several life-domains reduces the item order effect and with the small samples in Schwarz et al. (1991), the difference between r = .32 and r = .46 is not statistically significant, meaning it could be a chance finding.  So, their own research suggests that temporarily accessible information may typically have a small effect on life-satisfaction and this conclusion would be consistent with the evidence in the literature.


In light of these findings, it is important to highlight some limits for the emergence of question-order effects. First, question-order effects of the type discussed here are to be expected only when answering a preceding question increases the temporary accessibility of information that is not chronically accessible anyway…  Hence, chronically accessible current concerns would limit the size of any emerging effect, and the more they do so, the more extreme the implications of these concerns are.


Here the authors acknowledge that there are theoretical reasons why item-order effects should typically not have a strong influence on well-being reports.  One reason is that some information such as marital satisfaction is likely to be used even if marriage is not made salient by a preceding question.  It is therefore, not clear why marital satisfaction would produce a big increase from r = .32 to r = .67, as this would imply that numerous respondents do not consider their marriage when they made the judgment and it would explain why other studies found much weaker effects for item-order effects with marital satisfaction and higher correlations between marital satisfaction and life-satisfaction than r  =.32.  However, it is interesting that this important theoretical point is offered only as a qualification after presenting evidence from two studies that did show strong item-order effects. If the argument had been presented first, the question would arise why these studies did produce strong item-order effects and it would be evident that it is impossible to generalize from these specific studies to well-being reports in general.




“Complicating things further, information rendered accessible by a preceding question may not always be used.”


How is this complicating things further?  If there are ways to communicate to respondents that they should not be influenced by previous items (e.g., “Now on to another topic” or “take a moment to think about the most important aspects of your life”) and this makes context effects disappear, why don’t we just use the proper conversational norms to avoid these undesirable effects? And some surveys actually do this and we would therefore expect that they elicit valid reports of well-being that are not based on responses to previous questions in the survey.


In the above studies (Strack et al. 1988; Schwarz ct al. 1991), the conversational norm of nonrcdundancy was evoked by a joint lead-in that informed respondents that they would now be asked two questions pertaining to their well-being. Following this lead-in, they first answered the specific question (about dating frequency or marital satisfaction) and subsequently reported their general life satisfaction. In this case, the previously observed correlations of r = .66 between dating frequency and life satisfaction, or of r = .67 between marital satisfaction and life satisfaction, dropped to r = -15 and .18, respectively. Thus, the same question order resulted in dramatically different correlations, depending on the elicitation of the conversational norm of nonredundancy. 


The only evidence for these effects comes from a couple of studies by the authors.  Even if these results hold, they suggest that it should be possible to use conversational norms to get the same results for both item-orders if conversational norms suggest that participants should use all relevant chronically accessible information.  However, the authors did not conduct such as study. One reason may be that the prediction would be that there is no effect and researchers are only interested in using manipulations that show effects so that they can reject the null-hypothesis. Another explanation could be that Schwarz and Strack’s program of research on well-being reports was built on the heuristics and bias program in social psychology that is only interested in showing biases and ignores evidence for accuracy (Funder, 1987). The only result that is deemed relevant and worthy of publishing are experiments that successfully created a bias in judgments. The problem with this approach is that it cannot reveal that these judgments are also accurate and can be used as valid measures of well-being.




Judgments arc based on the subset of potentially applicable information that is chronically or temporarily accessible at the time.


Yes, it is not clear what else the judgments could be based on.


Accessible information, however, may not be used when its repeated use would violate conversational norms of nonredundancy.


Interestingly this statement would imply that participants are not influenced by subtle information (priming). The information has to be consciously accessible to determine whether it is relevant and only accessible information that is considered relevant is assumed to influence judgments.  This also implies that making information accessible that is not considered relevant will not have an influence on well-being reports. For example, asking people about their satisfaction with the weather or the performance of a local sports team does not lead to a strong influence of this information on life-satisfaction judgments because most people do not consider this information relevant (Schimmack et al., 2002). Once more, it is not clear how well-being reports can be highly context dependent, if information is carefully screened for relevance and responses are only made when sufficient relevant information was retrieved.




Suppose that an extremely positive (or negative) life event comes to mind. If this event is included in the temporary representation of the target “my life now,” it results in a more positive (negative) assessment of SWB, reflecting an assimilation effect, as observed in an increased correlation in the studies discussed earlier. However, the same event may also be used in constructing a standard of comparison, resulting in a contrast efficient: compared to an extremely positive (negative) event, one’s life in general may seem relatively bland (or pretty benign). These opposite influences of the same event are sometimes referred to as endowment (assimilation) and contrast effects (Tversky. and Griffin 1991).


This is certainly a possibility, but it not necessarily limited to temporarily accessible information.  A period in an individuals’ life may be evaluated relative to other periods in a person’s life.  In this way, subjective well-being is subjective. Objectively identical lives can be evaluated differently because past experiences created different ideals or comparison standards (see Cantril’s early work on human concerns).  This may happen for chronically accessible information just as much as for temporarily accessible information and it does not imply that well-being reports are invalid; it just shows that they are subjective.


Strack, Schwarz, and Gschneidingcr (1985, Experiment 1) asked respondents to report either three positive or three negative recent life events, thus rendering these events temporarily accessible.  As shown in the top panel of Table 1, these respondents reported higher current life satisfaction after they recalled three positive rather than negative recent events. Other respondents, however, had to recall events that happened at least five years before. These respondents reported higher current life satisfaction after recalling negative rather than positive past events. 


This finding shows that contrast effects can occur.  However, it is important to note that these context effects were created by the experimental manipulation.  Participants were asked to recall events from 5 years ago.  In the naturalistic scenario, where participants are simply asked to report “how is your life these days” participants are unlikely to suddenly recall events from 5 years ago.   Similarly, if you were asked about your happiness with your last vacation you are unlikely to recall earlier vacations and contrast your most recent vacation with it.  Indeed, Suh et al. (1996) showed that life-satisfaction judgments are influenced by recent events and that older events do not have an effect. They found no evidence for contrast effects when participants were not asked to recall events from the distant past.  So, this research shows what can happen in a specific context where participants were to recall extreme negative or positive from their past, but without prompting by an experimenter this context hardly ever would occur.  Thus, this study has no ecological or external validity for the question how participants actually make life-satisfaction judgments.


These experimental results are consistent with correlational data (Elder 1974) indicating that U.S. senior citizens, the “children of the Great Depression,” are more likely to report high subjective well-being the more they suffered under adverse economic conditions when they were adolescents. 


This finding again does not mean that elderly US Americans who suffered more during the Great Depression were actively thinking about the Great Depression when they answered questions about their well-being. It is more likely that they may have lower aspirations and expectations from life (see Easterlin). This means that we can interpret this result in many ways. One explanation would be that well-being judgments are subjective and that cultural and historic events can shape individuals’ evaluation standards of their lives.




In combination, the reviewed research illustrates that the same life event may affect judgments of SWB in opposite directions, depending on its use in the construction of the target “my life now” and of a relevant standard of comparison.


Again, the word “may” makes this statement true. Many things may happen, but that tells us very little about what actually is happening when respondents report on their well-being.  How past negative events can become positive events (a divorce was terrible, but it feels like a blessing after being happily remarried, etc.) and positive events can become negative events (e.g., the dream of getting tenure comes true, but doing research for life happens to be less fulfilling than one anticipated) is an interesting topic for well-being research, but none of these evaluative reversals undermine the usefulness of well-being measures. In fact, they are needed to reveal that subjective evaluations have changed and that past evaluations may have carry over effects on future evaluations.


It therefore comes as no surprise that the relationship between life events and judgments of SWB is typically weak. Today’s disaster can become tomorrow’s standard, making it impossible to predict SWB without a consideration of the mental processes that determine the use of accessible information.


Actually, the relationship between life-events and well-being is not weak.  Lottery winners are happier and accident victims are unhappier.  And cross-cultural research shows that people do not simply get used to terrible life circumstances.  Starving is painful. It does not become a normal standard for well-being reports on day 2 or 3.  Most of the time, past events simply lose importance and are replaced by new events and well-being measures are meant to cover a certain life period rather than an individual’s whole life from birth to death.  And because subjective evaluations are not just objective reports of life-events, they depend on mental processes. The problem is that a research program that uses experimental manipulations does not tell us about the mental processes that are underlying life-satisfaction judgments when participants are not manipulated.




Counterfactual thinking can influence affect and subjective well-being in several ways (see Roese 1997; Roese and Olson 1995b).


Yes, it can, it may, and it might, but the real question is whether it does influence well-being reports and if so, how it influences these reports.


For example, winners of Olympic bronze medals reported being more satisfied than silver medalists (Medvec, Madey, and Gilovich 1995), presumably because for winners of bronze medals, it is easier to imagine having won no medal at all (a “downward counterfactual”), while for winners of silver medals, it is easier to imagine having won the gold medal (an “upward counterfactual”).


This is not an accurate summary of the article that contained three studies.  Study 1 used ratings of video clips of Olympic medalists immediately after the event (23 silver & 18 bronze medalists).  The study showed a strong effect that bronze medalists were happier than silver medalists, F(1,72) = 18.98.  The authors also noted that in some events the silver medal means that an athlete lost a finals match, whereas in other events they just placed second in a field of 8 or more athletes.  An analysis that excluded final matches showed weaker evidence for the effect, F(1,58) = 6.70.  Most important, this study did not include subjective reports of satisfaction as claimed in the review article. Study 2 examined interviews of 13 silver and 9 bronze medalists.  Participants in Study 2 rated interviews of silver medal athletes to contain more counter-factual statements (e.g., I almost), t(20) = 2.37, p< ,03.  Importantly, no results regarding satisfaction are reported. Study 3 actually recruited athletes for a study and had a larger sample size (N = 115). Participants were interviewed by the experimenters after they won a silver or bronze medal at an athletic completion (not the Olympics).   The description of the procedure is presented verbatim here.


Procedure. The athletes were approached individually following their events and asked to rate their thoughts about their performance on the same 10-point scale used in Study 2. Specifically, they were asked to rate the extent to which they were concerned with thoughts of “At least I. . .” (1) versus “/ almost” (10). Special effort was made to ensure that the athletes understood the scale before making their ratings. This was accomplished by mentioning how athletes might have different thoughts following an athletic competition, ranging from “I almost did better” to “at least I did this well.”


What is most puzzling about this study is why the experiments seemingly did not ask questions about emotions or satisfaction with performance.  It would have taken only a couple of questions to obtain reports that speak to the question of the article whether winning a silver medal is subjectively better than winning a bronze medal.  Alas, these questions are missing. The only result from Study 3 is “as predicted, silver medalists’ thoughts following the competition were more focused on “I almost” than were bronze medalists’.  Silver medalists described their thoughts with a mean rating of 6.8 (SD = 2.2), whereas bronze medalists assigned their thoughts an average rating of 5.7 (SD = 2.7), t(113) = 2.4, p < .02.


In sum, there is no evidence in this study that winning an Olympic silver medal or any other silver medal for that matter makes athletes less happy than winning a bronze medal. The misrepresentation of the original study by Schwarz and Strack is another example of unscientific practices that can lead to the fabrication of false facts that are difficult to correct and can have a lasting negative effect on the creation of a cumulative science.


In summary, judgments of SWB can be profoundly influenced by mental constructions of what might have been.


This statement is blatantly false. The cited study on medal winners does not justify this claim and thre is no scientific basis for the claim that these effects are profound.


In combination, the discussion in the preceding sections suggests that nearly any aspect of one’s life can be used in constructing representations of one’s “life now” or a relevant standard, resulting in many counterintuitive findings.


A collection of selective findings that were obtained using different experimental procedures does not mean that well-being reports obtained under naturalistic conditions produce many counterintuitive findings, nor is there any evidence that they do produce many counterintuitive findings.  This statement lacks any empirical foundation and is inconsistent with other findings in the well-being literature.


Common sense suggests that misery that lasts for years is worse than misery that lasts only for a few days.


Indeed. Extended periods of severed depression can drive some people to attempt suicide. A week with the flu does not. Consistent with this common sense observation, well-being reports of depressed people are much lower than those of other people, once more showing that well-being reports often produce results that are consistent with intuitions.


Recent research suggests, however, that people may largely neglect the duration of the episode, focusing instead on two discrete data points, namely, its most intense hedonic moment (“peak”) and its ending (Fredrickson and Kahneman 1993; Varey and Kahneman 1992). Hence, episodes whose worst (or best) moments and endings are of comparable intensity are evaluated as equally (un)pleasant, independent of their duration (for a more detailed discussion, see Kahneman, this volume).


Yes, but this research focusses on brief episodes with a single emotional event.  It is interesting that duration of episodes seems to matter very little, but life is a complex series of events and episodes. Having sex for 20 minutes or 30 minutes may not matter, but having sex regularly, at least once a week, does seem to matter for couples’ well-being.  As Diener et al. (1985) noted, it is the frequency, not the intensity (or duration) of positive and negative events in people’s lives that matters.


Although the data are restricted to episodes of short duration, it is tempting to speculate about the possible impact of duration neglect on the evaluation of more extended episodes.


Yes, interesting, but this statement clearly indicates that the research on duration neglect is not directly relevant for well-being reports.


Moreover, retrospective evaluations should crucially depend on the hedonic value experienced at the end of the respective episode.


This is a prediction not a fact. I have actually examined this question and found that frequency of positive and negative events has a stronger influence on satisfaction judgments with a day than how respondents felt at the end of the day when they reported daily satisfaction.




As our selective review illustrates, judgments of SWB are not a direct function of one’s objective conditions of life and the hedonic value of one’s experiences.


First, it is great that the authors acknowledge here that their review is selective.  Second, we do not need a review to know that subjective well-being is not a direct function of objective life conditions. The whole point of subjective well-being reports is to allow respondents to evaluate these events from their own subjective point of view.  And finally, at no point has this selective review shown that these reports do not depend on the hedonic value of one’s experiences. In fact, measures of hedonic experiences are strong predictors of life-satisfaction judgments (Schimmack et al., 2002; Lucas et al., 1996; Zou et al., 2012).


Rather they crucially depend on the information that is accessible at the time of judgment and how this information is used in constructing mental representations of the to-be-evaluated episode and a relevant standard.


This factual statement cannot be supported by a selective review of the literature. You cannot say, my selective review of creationist literature shows that evolution theory is wrong.  You can say that a selective review of creationist literature would suggest that evolution theory is wrong, but you cannot say that it is wrong. To make scientific statements about what is (highly probable to be) true and what is (highly probable to be) false, you need to conduct a review of the evidence that is not selective and not biased.


As a result of these construal processes, judgments of SWB are highly malleable and difficult to predict on the basis of objective conditions. 


This is not correct.  Evaluations do not directly depend on objective conditions. This is not a feature of well-being reports but a feature of evaluations.  At the same time, the construal processes that related objective events to subjective well-being are systematic, predictable, and depend on chronically accessible and stable information.  Well-being reports are highly correlated with objective characteristics of nations, bereavement, unemployment, and divorce have negative effects on well-being and winning the lottery, marriage, and remarriage have positive effects on well-being.  Schwarz and Strack are fabricating facts. This is not considered fraud. Only data manipulation and fabricating data is considered scientific fraud, but this does not mean that fabricated facts are less harmful than fabricated data.  Science can only provide a better understanding if it is based on empirically verified and replicable facts. Simply stating ‘judgments of SWB are difficult to predict” without providing any evidence for this claim is unscientific.




The causal impact of comparison processes has been well supported in laboratory experiments that exposed respondents to relevant comparison standards…For example, Strack and his colleagues (1990) observed that the mere presence of a handicapped confederate was sufficient to increase reported SWB under self-administered questionnaire conditions, presumably because the confederate served as a salient standard of comparison….As this discussion indicates, the impact of social comparison processes on SWB is more complex than early research suggested. As far as judgments of global SWB are concerned, we can expect that exposure to someone who is less well off will usually result in more positive-and to someone who is better off in more negative assessments of one’s own life.  However, information about the other’s situation will not always be used as a comparison standard.

The whole section about social comparison does not really address the question of the influence of social comparison effects on well-being reports.  Only a single study with a small sample is used to provide evidence that respondents may engage in social comparison processes when they report their well-being.  The danger of this occurring in a naturalistic context is rather slim.  Even in face-to-face interviews, the respondent is likely to have answered several questions about themselves and it seems far-fetched that they would suddenly think about the interviewer as a relevant comparison standard, especially if the interviewer does not have a salient characteristic like a disability that may be considered relevant. Once more the authors generalize from one very specific laboratory experiment to the naturalistic context in which SWB reports are normally made without considering the possibility that the experimental results are highly contextual sensitive and do not reveal how respondents normally judge their lives.

[Standards Provided by the Social Environment]

In combination, these examples draw attention to the possibility that salient comparison standards in one’s immediate environment, as well as socially shared norms, may constrain the impact of fortuitous temporary influences. At present, the interplay of chronically and temporarily accessible standards on judgments of SWB has received little attention. The complexities that are likely to result from this interplay provide a promising avenue for future research.

Here the authors acknowledge that their program of research is limited and fails to address how respondents use chronically accessible information. They suggest that this is a promising avenue for future research, but they fail to acknowledge why they haven’t conducted studies that start to address this question. The reason is that their research program with experimental manipulations of the situation doesn’t allow to study the use of chronically accessible information.  The use of information that by definition comes to mind spontaneously independent of researchers’ experimental manipulations is a blind-spot of the experimental approach.

[Interindividual Standards Implied by the Research Instrument]

Finally, we extend our look at the influences of the research instrument by addressing a frequently overlooked source of temporarily accessible comparison information…As numerous studies have indicated (for a review, see Schwarz 1996), respondents assume that the list of response alternatives reflects the researcher’s knowledge of the distribution of the behavior: they assume that the “average” or “usual” behavioral frequency is represented by values in the middle range of the scale, and that the extremes of the scale correspond to the extremes of the distribution. Accordingly, they use the range of the response alternatives as a frame of reference in estimating their own behavioral frequently, resulting in different estimates of their own behavioral frequency, as shown in table 4.2. More important for our present purposes, they further extract comparison information from their low location on the scale…Similar findings have been obtained with regard to the frequency of physical symptoms and health satisfaction (Schwarz and Scheuring 1992), the frequency of sexual behaviors and marital satisfaction (Schwarz and Scheuring 1988), and various consumer behaviors (Menon, Raghubir, and Schwarz 1995).

One study is in German and not available.  I examined the study by Schwarz and Scheurig (1988) in European Journal of Social Psychology.   Study 1 had four conditions with n = 12 or 13 per cell (N = 51).  The response format varied frequencies so that having sex or masturbating once a week was either a high or low frequency occurrence.  Subsequently, participants reported their relationship satisfaction. The relationship satisfaction ratings were analyzed with an ANOVA.  “Analysis of variance indicates a marginally reliable interaction of both experimental variables, F(1,43) = 2.95, p < 0.10, and no main effects.”  The result is not significant by conventional standards and the degrees of freedom show that some participants were excluded from this analysis without further mentioning of this fact.  Study 2 manipulated the response format for frequency of sex and masturbation within subject. That is, all subjects were asked to rate frequencies of both behaviors in four different combinations. There were n = 16 per cell, N = 64. No ANOVA is reported presumably because it was not significant. However, a PLANNED contrast between the high sex/low masturbation and the low/sex high masturbation group showed a just significant result, t(58) = 2.17, p = .034. Again, the degrees of freedom do not match sample size. In conclusion, the evidence that subtle manipulations of response formats can lead to social comparison processes that influence well-being reports is not conclusive. Replication studies with larger samples would be needed to show that these effects are replicable and to determine how strong these effects are.

In combination, they illustrate that response alternatives convey highly salient comparison standards that may profoundly affect subsequent evaluative judgments.

Once, more the word “may” makes the statement true in a trivial sense that many things may happen. However, there is no evidence that these effects actually have profound effects on well-being reports, and the existing studies show statistically weak evidence and provide no information about the magnitude of these effects.

Researchers are therefore well advised to assess information about respondents’ behaviors or objective conditions in an open-response format, thus avoiding the introduction of comparison information that respondents would not draw on in the absence of the research instrument.

There is no evidence that this would improve the validity of frequency reports and research on sexual frequency shows similar results with open and closed measures of sexual frequency (Muise et al., 2016).


In summary, the use of interindividual comparison information follows the principle of cognitive accessibility that WC have highlighted in our discussion of intraindividual comparisons. Individuals often draw on the comparison information that is rendered temporarily accessible by the research instrument or the social context in which they form the judgment, although chronically accessible standards may attenuate the impact of temporarily accessible information.

The statement that people often rely on interpersonal comparison standards is not justified by the research.  By design, experiments that manipulate one type of information and make it salient cannot determine how often participants use this type of information when it is not made salient.


In the preceding sections, we considered how respondents use information about their own lives or the lives of others in comparison-based evaluation strategies. However, judgments of well-being are a function not only of what one thinks about but also of how one feels at the time of judgment.

Earlier, the authors stated that respondents are likely to use a minimum of information that is deemed sufficient. “Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987)”  Now we are supposed to believe that they use intrapersonal and interpersonal information that is temporarily and chronically accessible and their feelings.  That is a lot of information and it is not clear how all of this information is combined into a single judgment. A more parsimonious explanation for the host of findings is that each experiment carefully created a context that made respondents use the information that the experimenters wanted respondents to use to confirm the hypothesis that they use this information. The problem is that this only shows that a particular source of information may be used in one particular context. It does not mean that all of these sources of information are used and need to be integrated into a single judgment under naturalistic conditions. The program of research simply fails to address the question which information respondents actually use when they are asked to judge their well-being in a normal context.

A wide range of experimental data confirms this intuition. Finding a dime on a copy machine (Schwarz 1987), spending time in a pleasant rather than an unpleasant room (Schwarz ct al. 1987, Expcrimcnt 2), or watching the German soccer team win rather than lose a championship game (Schwarz et al. 1987, Experiment 1) all resulted in increased reports of happiness and satisfaction with one’s life as a whole…Experimental evidence supports this assumption. For example, Schwarz and Clore (1983, Experiment 2) called respondents on sunny or rainy days and assessed reports of SWB in telephone interviews. As expected, respondents reported being in a better mood, and being happier and more satisfied with their life as a whole, on sunny rather than on rainy days. Not so, however, when respondents’ attention was subtly drawn to the weather as a plausible cause of their current feelings.

The problem is that all of the cited studies were conducted by Schwarz and that other studies that produced different results are not mentioned.  The famous weather study has recently been called into question.  However, the weather effect on life-satisfaction judgments is not ideal because weather effects on mood are not very strong either.  Respondents in sunny California do not report higher life-satisfaction than respondents in Ohio (Schkade & Kahneman), and several large scale studies have now failed to replicate the famous weather effect on well-being reports (Lucas & Lawless, 2013; Schmiedeberg, 2014).

On theoretical grounds, we may assume that people are more likely to use the simplifying strategy of consulting their affective state the more burdensome it would be to form a judgment on the basis of comparison information.

Here it is not clear why it would be burdensome to make global life-satisfaction judgments. The previous chapters suggested that respondents have access to large amount of chronically and temporarily information that they apparently used in the previous studies. Suddenly, it is claimed that retrieving relevant information is too hard and mood is used. It is not clear why respondents would consider their current mood sufficient to evaluate their lives, especially if inconsistent accessible information also comes to mind.

Note in this regard that evaluations of general life satisfaction pose an extremely complex task that requires a large number of comparisons along many dimensions with ill-defined criteria and the subsequent integration of the results of these comparisons into one composite judgment. Evaluations of specific life domains, on the other hand, are often less complex.

If evaluations of specific life domains are less complex and global questions are just an average of specific domains, it is not clear why it would be so difficult to evaluate satisfaction in a few important life domains (health, family, work) and integrate this information.  The hypothesis that mood is only used as a heuristic for global well-being reports also suggests that it would be possible to avoid the use of this heuristic by asking participants to report satisfaction with specific life domains. As these questions are supposed to be easier to answer, participants would not use mood. Moreover, preceding items are less likely to make information accessible that is relevant for a specific life domain.  For example, a dating question is irrelevant for satisfaction with academic or health satisfaction.  Thus, participants are most likely to draw on chronically accessible information that is relevant for answering a question about satisfaction with specific domains. It follows that averages of domain satisfaction judgments would be more valid than global judgments, if participants were relying on mood to judge global judgments. For example, finding a dime would make people judge their lives more positively, but not their health, social relationships, and income.  Thus, many of the alleged problems with global well-being reports could be avoided by asking for domain specific reports and then aggregate them (Andrews & Whithey, 1976; Zou et al., 2013).

If judgments of general well-being are based on respondents’ affective state, whereas judgments of domain satisfaction are based on comparison processes, it is conceivable that the same event may influence evaluations of one’s life as a whole and evaluations of specific domains in opposite directions. For example, an extremely positive event in domain X may induce good mood, resulting in reports of increased global SWB. However, the same event may also increase the standard of comparison used in evaluating domain X, resulting in judgments of decreased satisfaction with this particular domain. Again, experimental evidence supports this conjecture. In one study (Schwarz ct al. 1987, Experiment 2), students were tested in cither a pleasant or an unpleasant room, namely, a friendly office or a small, dirty laboratory that was overheated and noisy, with flickering lights and a bad smell. As expected, participants reported lower general life satisfaction in the unpleasant room than in the pleasant room, in line with the moods induced by the experimental rooms. In contrast, they reported higher housing satisfaction in the unpleasant than in the pleasant room, consistent with the assumption that the rooms served as salient standards of comparison.

The evidence here is a study with 22 female students assigned to two conditions (n = 12 and 10 per condition).  The 2 x 2 ANOVA with room (pleasant vs. unpleasant) and satisfaction judgment (life vs. housing) produced a significant interaction of measure and room, F(1,20) = 7.25, p = .014.  The effect for life-satisfaction was significant, F(1,20) = 8.02, p = .010 (reported as p < .005), and not significant for housing satisfaction, F(1,20) = 1.97, p = .18 (reported as p < .09 one-tailed).

This weak evidence in a single study with a very small sample is used to conclude that life-satisfaction judgments and domain satisfaction judgments may diverge.  However, numerous studies have shown high correlations between average domain satisfaction judgments and global life-satisfaction judgments (Andrews & Whithey, 1976; Schimmack & Oishi, 2005; Zou et al., 2013).  This finding cannot occur if respondents use mood for life-satisfaction judgments and other information for domain satisfaction judgments.  Yet readers are not informed about this finding that undermines Schwarz and Stracks’ model of well-being reports and casts doubt on the claim that the same information has opposite effects on global life-satisfaction judgments and domain specific judgments. This may happen in highly artificial laboratory conditions, but it does not happen often in normal survey contexts.

The Relative Salience of Mood and Competing Information

If recalling a happy or sad life event elicits a happy or sad mood at the time of recall, however, respondents are likely to rely on their feelings rather than on recalled content as a source of information. This overriding impact of current feelings is likely to result in mood-congruent reports of SWB, independent of the mental construal variables discussed earlier. The best evidence for this assumption comes from experiments that manipulated the emotional involvement that subjects experienced while thinking about past life events.

This section introduces a qualification of the earlier claim that recall of events in the remote past leads to a contrast effect.  Here the claim is that recalling a positive event from the remote past (a happy time with a deceased spouse) will not lead to a contrast effect (intensify dissatisfaction of a bereaved person), if the recall of the event triggers an actual emotional experiences (My life these days is good because I feel good when I think about the good times in the past).  The problem with this theory is that it is inconsistent with the earlier claims that people will discount their current feelings if they think they are irrelevant. If respondents do not use mood to judge their lives when they attribute it to the weather, it is not clear why they would use their feelings if they are triggered by recall of an emotional event from their past?  Why would a widower evaluate his current life as a widower more favorably when he is recalling the good times with his wife?

Even if this were a reliable finding, it would be practically irrelevant for actual ratings of life-satisfaction because respondents are unlikely to recall specific events in sufficient detail to elicit strong emotional reactions.  The studies that demonstrated the effect instructed participants to do so, but under normal circumstances participants make judgments very quickly often without recall of detailed, specific emotional episodes.  In fact, even the studies that showed these effects showed only weak evidence that recall of emotional events had notable effects on mood (Strack et al.. 1985).


Self-presentation and social desirability concerns may arise at the reporting stage, and respondents may edit their private judgment before they communicate it

True. All subjective ratings are susceptible to reporting styles. This is why it is important to corroborate self-ratings of well-being with other evidence such as informant ratings of well-being.  However, the problem of reporting biases would be irrelevant, if the judgment without these biases is already valid. A large literature on reporting biases in general shows that these biases account for a relatively small amount of the total variance in ratings. Thus, the key question remains whether the remaining variance provides meaningful information about respondents’ subjective evaluations of their lives or whether this variance reflects highly unreliable and context-dependent information that has no relationship to individuals’ subjective well-being.


Figure 4.2 summarizes the processes reviewed in this chapter. If respondents are asked to report their happiness and satisfaction with their “life as a whole,” they are likely to base their judgment on their current affective state; doing so greatly simplifies the judgmental task.

As noted before, this would imply that global well-being reports are highly unstable and strongly correlated with measures of current mood, but the empirical evidence does not support these predictions.  Current mood has a small effect on global well-being reports (Eid & Diener, 2004) and they are highly stable (Schimmack & Oishi, 2005) and predicted by personality traits even when these traits are measured a decade before the well-being reports (Costa & McCrae, 1980).

If the informational value of their affective state is discredited, or if their affective state is not pronounced and other information is more salient, they are likely to use a comparison strategy. This is also the strategy that is likely to be used for evaluations of less complex specific life domains.

Schwarz and Strack’s model would allow for weak mood effects. We only have to make the plausible assumption that respondents often have other information to judge their lives and that they find this information more relevant than their current feelings.  Therefore, this first stage of the judgment model is consistent with evidence that well-being judgments are only weakly correlated with mood and highly stable over time.

When using a comparison strategy, individuals draw on the information that is chronically or temporarily most accessible at that point in time. 

Apparently the term “comparison strategy” is now used to refer to the retrieval of any information rather than an active comparison that takes place during the judgment process.  Moreover, it is suddenly equally plausible that participants draw on chronically accessible information or on temporarily accessible information.  While the authors did not review evidence that would support the use of chronically accessible information, their model clearly allows for the use of chronically accessible information.

Whether information that comes to mind is used in constructing a representation of the target  “my life now” or a representation of a relevant standard depends on the variables that govern the use of information in mental construal (Schwarz and Bless 1992a; Strack 1992). 

This passage suggests that participants have to go through the process of evaluating their live each time when they are asked to make a well-being report. They have to construct what their live is like, what they want from life, and make a comparison. However, it is also possible that they can draw on previous evaluations of life domains (e.g., I hate my job, I am healthy, I love my wife, etc.). As life-satisfaction judgments are made rather quickly within a few seconds, it seems more plausible that some pre-established evaluations are retrieved than to assume that complex comparison processes are being made at the time of judgments.

If the accessibility of information is due to temporary influences, such as preceding questions in a questionnaire, the obtained judgment is unstable over time and a different judgment will be obtained in a different context.

This statement makes it obvious that retest correlations provide direct evidence on the use of temporarily accessible information.  Importantly, low retest stability could be caused by several factors (e.g. random responding).  So, we cannot verify that participants rely on temporarily accessible information when retest correlations are low. However, we can use high retest stability to falsify the hypothesis that respondents rely heavily on temporarily accessible information because the theory makes the opposite prediction.  It is therefore highly relevant that retest correlations show high temporal consistency in global well-being reports.  Based on this solid empirical evidence we can infer that responses are not heavily influenced by temporarily accessible information (Schimmack & Oishi, 2005).

On the other hand, if the accessibility of information reflects chronic influences such as current concerns or life tasks, or stable characteristics of the social environment, the judgment is likely to be less context dependent.

This implies that high retest correlations are consistent with the use of chronically accessible information, but high retest correlations do not prove that participants use chronically accessible information. It is also possible that stable variance is due to reporting styles. Thus, other information is needed to test the use of chronically accessible information. For example, agreement in well-being reports by several raters (self, spouse, parent, etc.) cannot be attributed to response styles and shows that different raters rely on the same chronically accessible information to provide well-being reports (Schneider & Schimmack, 2012).

The size of context-dependent assimilation effects increases with the amount and extremity of the temporarily accessible information that is included in the representation of the target. 

This part of the model would explain why experiments and naturalistic studies often produce different results. Experiments make temporarily accessible information extremely salient, which may lead participants to use it. In contrast, such extremely salient information is typically absent in naturalistic studies, which explains why chronically accessible information is used. The results are only inconsistent if results from experiments with extreme manipulations are generalized to normal contexts without these extreme conditions.


Our review emphasizes that reports of well-being are subject to a number of transient influences. 

This is correct. The review emphasized evidence from the authors’ experimental research that showed potential threats to the validity of well-being judgments. The review did not examine how serious these threats are for the validity of well-being judgments.

Although the information that respondents draw on reflects the reality in which they live, which aspects of this reality they consider and how they use these aspects in forming a judgment is profoundly influenced by features of the research instrument.

This statement is blatantly false.  The reviewed evidence suggests that the testing situation (a confederate, a room) or an experimental manipulation (recall positive or negative events) can influence well-being reports. There was very little evidence that the research instrument influenced well-being reports and there was no evidence that these effects are profound.

[Implications for Survey Research]

The reviewed findings have profound methodological implications.

This is wrong. The main implication is that researchers have to consider a variety of potential threats to the validity of well-being judgments. All of these threats can be reduced and many survey studies do take care to avoid some of these potential problems.

First, the obtained reports of SWB are subject to pronounced question-order effects because the content of preceding questions influences the temporary accessibility of relevant information.

As noted earlier, this was only true in two studies by the authors. Other studies do not replicate this finding.

Moreover, questionnaire design variables, like the presence or absence of a joint lead-in to related questions, determine how respondents use the information that comes to mind. As a result, mean reported well-being may differ widely, as seen in many of the reviewed examples

The dramatic shifts in means are limited to experimental studies that manipulated lead-ins to demonstrate these effects. National representative surveys show very similar means year after year.

Moreover, the correlation between an objective condition of life (such as dating frequency) and reported SWB can run anywhere from r = – .l to r = .6, depending on the order in which the same questions are asked (Strack et al. 1988), suggesting dramatically different substantive conclusions.

Moreover?  This statement just repeats the first false claim that question order has profound effects on life-satisfaction judgments.

Second, the impact of information that is rendered accessible by preceding questions is attenuated the more the information is chronically accessible (see Schwarz and Bless 1992a).

So, how can we see pronounced item-order effects for marital satisfaction if marital satisfaction is a highly salient and chronically accessible aspects of married people’s lives? So, this conclusion directly undermines the previous claim that item-order has profound effects.

Third, the stability of reports of SWB over time (that is, their test-retest reliability) depends on the stability of the context in which they are assessed. The resulting stability or change is meaningful when it reflects the information that respondents spontaneously consider because the same, or different, concerns are on their mind at different points in time. 

There is no support for this claim. If participants draw on chronically accessible information, which the authors’ model allows, the judgments do not depend on the stability of the context because chronically accessible information is by definition context-independent.

Fourth, in contrast to influences of the research instrument, influences of respondents’ mood at the time of judgment are less likely to result in systematic bias. The fortuitous events that affect one respondent’s mood are unlikely to affect the mood of many others.

This is true, but it would still undermine the validity of the judgments.  If participants rely on their current mood, variation in these responses will be unreliable and unreliable measures are by definition invalid. Moreover, the average mood of participants during the time of a survey is also not a valid measure of average well-being. So, even though mood effects may not be systematic, they would undermine the validity of well-being reports. Fortunately, there is no evidence that mood has a strong influence on these judgments, while there is evidence that participants draw on chronically accessible information from important life domains (Schimmack & Oishi, 2005).

Hence, mood effects are likely to introduce random variation.

Yes, this is a correct prediction, but evidence contradicts this prediction, and the correct conclusion is that mood does not introduce a lot of random variation in well-being reports because it is not heavily used by respondents to evaluate their lives or specific aspects of their lives.

Fifth, as our review indicates, there is no reason to expect strong relationships between the objective conditions of life and subjective assessments of well-being under most circumstances.

There are many reasons not to expect strong correlations between life-events and well-being reports. One reason is that a single event is only a small part of a whole life and that few life events have such dramatic effects on life-satisfaction that they make any notable contribution to life-satisfaction judgments.  Another reason is that well-being is subjective and the same life event can be evaluated differently by different individuals. For example, the publication of this review in a top journal in psychology would have different effects on my well-being and on the well-being of Schwarz and Strack.

Specifically, strong positive relationships between a given objective aspect of life and judgments of SWB are likely to emerge when most respondents include the relevant aspect in the representation that they form of their life and do not draw on many other aspects. This is most likely to be the case when (a) the target category is wide (“my life as a whole”) rather than narrow (a more limited episode, for example); (b) the relevant aspect is highly accessible; and (c) other information that may be included in the representation of the target is relatively less accessible. These conditions were satisfied, for example, in the Strack, Martin, and Schwarz (1988) dating frequency study, in which a question about dating frequency rendered this information highly accessible, resulting in a correlation of r = .66 with evaluations of the respondent’s life as a whole. Yet, as this example illustrates, we would not like to take the emerging correlation seriously when it reflects only the impact of the research instrument, as indicated by the fact that the correlation was r = – .l if the question order was reversed.

The unrepresentative extreme result from Strack’s study is used again as evidence, when other studies do not show the effect (Schimmack & Oishi, 2005).

Finally, it is worth noting that the context effects reviewed in this chapter limit the comparability of results obtained in different studies. Unfortunately, this comparability is a key prerequisite for many applied uses of subjective social indicators, in particular their use in monitoring the subjective side of social change over time (for examples see Campbell 198 1; Glatzer and Zapf 1984).

This claim is incorrect. The experimental demonstrations of effects under artificial conditions that were needed to manipulate judgment processes do not have direct implications for the way participants actually judge well-being and the authors model allows for chronically accessible information to have a strong influence on these judgments under less extreme and less artificial conditions, and the authors model makes predictions that are disconfirmed by evidence of high stability and low correlations with mood.

Which Measures Are We to Use?

By now, most readers have probably concluded that there is little to be learned from self-reports of global well-being being.

If so, the authors succeeded with their biased presentation of the evidence to convince readers that these reports are highly susceptible to a host of context effects that make the outcome of the judgment process essentially unpredictable. Readers would be surprised to learn that well-being reports of twins who never met are positively correlated (Lykken & Tellgen, 1996).

Although these reports do reflect subjectively meaningful assessments, what is being assessed, and how, seems too context dependent to provide reliable information about a population’s well- being, let alone information that can guide public policy (but see Argyle, this volume, for a more optimistic take).

The claim that well-being reports are too context dependent to provide reliable information about a population’s well-being is false for several reasons.  First, the authors did not show that well-being reports are context dependent. They showed that with very extreme manipulations in highly contrived and unrealistic contexts, judgments moved around statistically significantly in some studies.  They did not show that these shifts are large, as it would require larger samples to estimate effect sizes. They did not show that these effects have a notable influence on well-being reports in actual surveys of populations well-being.  And finally, they already pointed out that some of these effects (e.g., mood effects) would only add random noise, which would lower the reliability of individuals’ well-being reports, but when aggregated across responses would not alter the mean of a sample. And last, but not least, the authors blatantly ignore evidence (reviewed in this volume by Diener and colleagues) that variation across nationally representative samples shows highly reliable variation across populations in different nations that are highly correlated with objective life circumstances that are correlated with nations’ wealth.

In short, Schwarz and Strack’s claims are not scientifically founded and merely express the authors’ pessimistic take on the validity of well-being reports.  This pessimistic view is a direct consequence of a myopic focus on laboratory experiments that were designed to invalidate well-being reports and ignoring evidence from actual well-being surveys that are more suitable to examine the reliability and validity of well-being reports when well-being reports are provided under naturalistic conditions.

As an alternative approach, several researchers have returned to Bentham’s ( 1789/1948) notion of happiness as the balance of pleasure over pain (for examples, SW Kahneman, this volume; Parducci 1995).

This statement ignores the important contribution of Diener (1984) who argued that the concept of well-being may consist of life evaluations as well as the balance of pleasure over pain or Positive Affect and Negative Affect, as these constructs are called in contemporary psychology. As a result of Diener’ s (1984) conception of well-being as a construct with three components, researchers have routinely measured global life-evaluations along with measures of positive and negative affect. A key finding is that these measures are highly correlated, although not perfectly identical (Lucas et al., 1996; Zou et al., 2013).  Schwarz and Strack ignore this evidence, presumably because it would undermine their view that global life-satisfaction judgments are highly context sensitive and that measures of positive and negative affect could produce notably different results.


In conclusion, Schwarz and Strack’s (1999) chapter is a prototypical example of several bad scientific practices.  First, the authors conduct a selective review of the literature that focuses on one specific paradigm and ignores evidence from other approaches.  Second, the review focuses strongly on original studies conducted by the authors themselves and ignores studies by other researchers that produced different results. Third, the original studies are often obtained with small samples and there are no independent replications by other researchers, but the results are discussed as if they are generalizable.  Fourth, life-satisfaction judgments are influenced by a host of factors and any study that focuses on one possible predictor of these judgments is likely to account for only a small amount of the variance. Yet, the literature review does not take effect sizes into account and the theoretical model overemphasizes the causes that were studied and ignores causes that were not studied.  Fifth, the experimental method has the advantage of isolating single causes, but it has the disadvantage that results cannot be generalized to ecologically valid contexts in which well-being reports are normally obtained. Nevertheless, the authors generalize from artificial experiments to the typical survey context without examining whether their predictions are confirmed.  Finally, the authors make broad and profound claims that do not logically follow from their literature review. They suggest that decades of research with global well-being reports can be dismissed because the measures are unreliable, but these claims are inconsistent with a mountain of evidence that shows the validity of these measures that the authors willfully ignore (Diener et al., 2009).

Unfortunately, the claims in this chapter were used by Noble Laureate Daniel Kahneman as arguments to push for an alternative conception and measurement of well-being.  In combination, the unscientific review of the literature and the political influence of a Noble price has had a negative influence on well-being science.  The biggest damage to the field has been the illusion that the processes underlying global well-being reports are well-understood. In fact, we know very little how respondents make these judgments and how accurate these judgments are.  The chapter lists a number of possible threats to the validity of well-being reports, but it is not clear how much these threats actually undermine the validity of well-being reports and what can be done to reduce biases in these measures to improve their validity.  A program that relies exclusively on experimental manipulations that create biases in well-being reports is unable to answer these questions because well-being judgments can be made in numerous ways and results that are obtained in artificial laboratory contexts may or may not generalize to the context that is most relevant, namely when well-being reports are used to measure well-being.

What is needed is a real scientific program of research that examines accuracy and biases in well-being reports and creates well-being measures that maximize accuracy and minimize biases. This is what all other sciences do when they develop measures of theoretical important constructs. It is time for well-being researchers to act like a normal science. To do so, research on well-being reports needs a fresh start and needs an objective and scientific review of the empirical evidence regarding the validity of well-being measures.