Rindex Logo

Dr. R’s Blog about Replicability

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DEFINITION OF REPLICABILITY:  In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

New (September 17, 2016):  Estimating Replicability: abstract and link to manuscript about statistical methods that can be used to estimate replicability on the basis of published test statistics. 

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
REPLICABILITY REPORTS:  Examining the replicability of research topics

RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

TOP TEN LIST

pecking.order

1. 2015 Replicability Rankings of over 100 Psychology Journals
Based on reported test statistics in all articles from 2015, the rankings show the typical strength of evidence for a statistically significant result in particular journals.  The method also estimates the file-drawer of unpublished non-significant results. Links to powergraphs provide further information (e.g., whether a journal has too many just significant results (p < .05 & p > .025).

weak

2. A (preliminary) Introduction to the Estimation of Replicability for Sets of Studies with Heterogeneity in Power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

trust

3.  Replicability-Rankings of Psychology Departments
This blog presents rankings of psychology departments on the basis of the replicability of significant results published in 105 psychology journals (see the journal rankings for a list of journals).   Reported success rates in psychology journals are over 90%, but this percentage is inflated by selective reporting of significant results.  After correcting for selection bias, replicability is 60%, but there is reliable variation across departments.

Say-No-to-Doping-Test-Image

4. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

5.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z < 1.96, p > .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

YM figure9

6.  Validation of Meta-Analysis of Observed (post-hoc) Power
This post examines the ability of various estimation methods to estimate power of a set of studies based on the reported test statistics in these studies.  The results show that most estimation methods work well when all studies have the same effect size (homogeneous) or if effect sizes are heterogeneous and symmetrically distributed (heterogeneous). However, most methods fail when effect sizes are heterogeneous and have a skewed distribution.  The post does not yet include the more recent method that uses the distribution of z-scores (powergraphs) to estimate observe power because this method was developed after this blog was posted.

004_4

7. Roy Baumeister’s R-Index
Roy Baumeister was a reviewer of my 2012 article that introduced the Incredibiliy Index to detect publication bias and dishonest reporting practices.  In his review and in a subsequent email exchange, Roy Baumeister admitted that his published article excluded studies that failed to produce results in support of his theory that blood-glucose is important for self-regulation (a theory that is now generally considered to be false), although he disagrees that excluding these studies was dishonest.  The R-Index builds on the incredibility index and provides an index of the strength of evidence that corrects for the influence of dishonest reporting practices.  This post reports the R-Index for Roy Baumeister’s most cited articles. The R-Index is low and does not justify the nearly perfect support for empirical predictions in these articles. At the same time, the R-Index is similar to R-Indices for other sets of studies in social psychology.  This suggests that dishonest reporting practices are the norm in social psychology and that published articles exaggerate the strength of evidence in support of social psychological theories.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/8. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

tide_naked

9.  The R-Index for 18 Multiple-Study Psychology Articles in the Journal SCIENCE.
Francis (2014) demonstrated that nearly all multiple-study articles by psychology researchers that were published in the prestigious journal SCIENCE showed evidence of dishonest reporting practices (disconfirmatory evidence was missing).  Francis (2014) used a method similar to the incredibility index.  One problem of this method is that the result is a probability that is influenced by the amount of bias and the number of results that were available for analysis. As a result, an article with 9 studies and moderate bias is treated the same as an article with 4 studies and a lot of bias.  The R-Index avoids this problem by focusing on the amount of bias (inflation) and the strength of evidence.  This blog post shows the R-Index of the 18 studies and reveals that many articles have a low R-Index.

snake-oil

10.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

 

 

 

 

 

 

 

 

 

 

 

 

 

z

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Manuscript under review, copyright belongs to Jerry Brunner and Ulrich Schimmack

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Jerry Brunner and Ulrich Schimmack
University of Toronto @ Mississauga

Abstract
In the past five years, the replicability of original findings published in psychology journals has been questioned. We show that replicability can be estimated by computing the average power of studies. We then present four methods that can be used to estimate average power for a set of studies that were selected for significance: p-curve, p-uniform, maximum likelihood, and z-curve. We present the results of large-scale simulation studies with both homogeneous and heterogeneous effect sizes. All methods work well with homogeneous effect sizes, but only maximum likelihood and z-curve produce accurate estimates with heterogeneous effect sizes. All methods overestimate replicability using the Open Science Collaborative reproducibility project and we discuss possible reasons for this. Based on the simulation studies, we recommend z-curve as a valid method to estimate replicability. We also validated a conservative bootstrap confidence interval that makes it possible to use z-curve with small sets of studies.

Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, P-curve, P-uniform, Z-curve, Effect size, Replicability, Simulation.

Link to manuscript:  http://www.utstat.utoronto.ca/~brunner/zcurve2016/HowReplicable.pdf

Link to website with technical supplement:
http://www.utstat.utoronto.ca/~brunner/zcurve2016/

 

 

badscience

A Critical Review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”

In this review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”, I present verbatim quotes form their chapter and explain why these statements are misleading or false, and how the authors distort the actual evidence by selectively citing research that supports their claims, while hiding evidence that contradicts their claims. I show that the empirical evidence for the claims made by Schwarz and Strack is weak and biased.

Unfortunately, this chapter has had a strong influence on Daniel Kahneman’s attitude towards life-satisfaction judgments and his fame as a Noble laureate has led many people to believe that life-satisfaction judgments are highly sensitive to the context in which these questions are asked and practically useless for the measurement of well-being.  This has led to claims that wealth is not a predictor of well-being, but only a predictor of invalid life-satisfaction judgments (Kahneman et al., 2006) or that the effects of wealth on well-being are limited to low incomes.  None of these claims are valid because they rely on the unsupported assumption that life-satisfaction judgments are invalid measures of well-being.

The original quotes are highlighted in bold followed by my comments.

Much of what we know about individuals’ subjective well-being (SWB) is based on self-reports of happiness and life satisfaction.

True. The reason is that sociologists developed brief, single-item measures of well-being that could be included easily in large surveys such as the World Value Survey, the German Socio-Economic Panel, or the US General Social Survey.  As a result, there is a wealth of information about life-satisfaction judgments that transcends scientific disciplines. The main contribution of social psychologists to this research program that examines how social factors influence human well-being has been to dismiss the results based on claims that the measure of well-being is invalid.

As Angus Campbell (1981) noted, the “use of these measures is based on the assumption that all the countless experiences people go through from day to day add to . . . global feelings of well-being, that these feelings remain relatively constant over extended periods, and that people can describe them with candor and accuracy”

Half true.  Like all self-report measures, the validity of life-satisfaction judgments depends on respondents’ ability and willingness to provide accurate information.  However, it is not correct to suggest that life-satisfaction judgments assume that feelings remain constant over extended periods of time or that respondents have to rely on feelings to answer questions about their satisfaction with life.  There is a long tradition in the well-being literature to distinguish cognitive measures of well-being like Cantrill’s ladder and affective measures that focus on affective experiences in the recent past like Bradburn’s affect balance scale.  The key assumption underlying life-satisfaction judgments is that respondents have chronically accessible information about their lives or can accurately estimate the frequency of positive and negative feelings. It is not necessary that the feelings are stable.

These assumptions have increasingly been drawn into question, however, as the empirical work has progressed.

It is not clear which assumptions have been drawn into question.  Are people unwilling to report their well-being, are they unable to do so, or are feelings not as stable as they are assumed to be? Moreover, the statement ignores a large literature that has demonstrated validity of well-being measures going back to the 1930s (see Diener et al., 2009; Scheider & Schimmack, 2009, for a meta-analysis).

First, the relationship between individuals’ experiences and objective conditions of life and their subjective sense of well-being is often weak and sometimes counter-intuitive.  Most objective life circumstances account for less than 5 percent of the variance in measures of SWB, and the combination of the circumstances in a dozen domains of life does not account for more than 10 percent (Andrews and Whithey 1976; Kammann; 1982; for a review, see Argyle, this volume).

 

First, it is not clear what weak means. How strong should the correlation between objective conditions of life and subjective well-being be?  For example, should marital status be a strong predictor of happiness? Maye it matters more whether people are happily married or unhappily married than whether they are married or single?  Second, there is no explanation for the claim that these relationships are counter-intuitive.  Employment, wealth, and marriage are positively related to well-being as most people would expect. The only finding in the literature that may be considered counter-intuitive is that having children does not notably increase well-being and sometimes decreases well-being. However, this does not mean well-being measures are false, it may mean that people’s intuitions about the effects of life-events on well-being are wrong. If intuitions would always be correct, we would not need scientific studies of determinants of well-being.

 

Second, measures of SWB have low test-retest reliabilities, usually hovering around .40, and not exceeding .60 when the same question is asked twice during the same one-hour interview (Andrews and Whithey 1976; Glatzer 1984). 

 

This argument ignores that responses to a single self-report item often have a large amount of random measurement error, unless participants can recall their previous answer.  The typical reliability of a single-item self-report measure is about r  =.6 +/- .2.  There is nothing unique about the results reported here for well-being measures. Moreover, the authors blatantly ignore evidence that scales with multiple items like Diener’s Satisfaction with Life Scale have retest correlations over r = .8 over a one-month period (see Schimmack & Oishi, 2005, for a meta-analysis).  Thus, this statement is misleading and factually incorrect.

 

Moreover, these measures are extremely sensitive to contextual influences.

 

This claim is inconsistent with the high retest correlation over periods of one month. Moreover, survey researchers have conducted numerous studies in which they examined the influence of the survey context on well-being measures and a meta-analysis of these studies shows only a small effect of previous items on these judgments and the pattern of results is not consistent across studies (see Schimmack & Oishi, 2005 for a meta-analysis).

 

Thus, minor events, such as finding a dime (Schwarz 1987) or the outcome of soccer games (Schwarz et al. 1987), may profoundly affect reported satisfaction with one’s life as a whole.

 

As I will show, the chapter makes many statements about what may happen.  For example, finding a dime may profoundly affect well-being report or it may not have any effect on these judgments.  These statements are correct because well-being reports can be made in many different ways. The real question is how these judgments are made when well-being measures are used to measure well-being. Experimental studies that manipulate the situation cannot answer this question because they purposefully create the situation to demonstrate that respondents may use mood (when mood is manipulated) or may use information that is temporarily accessible, when relevant information is made salient and temporarily accessible. The processes underlying judgments in these experiments may reveal influences on life-satisfaction judgment in a real survey context or they may reveal processes that do not occur under normal circumstances.

 

Most important, however, the reports are a function of the research instrument and are strongly influenced by the content of preceding questions, the nature of the response alternatives, and other “technical” aspects of questionnaire design (Schwarz and Strack 1991a, 1991b).

 

We can get different answers to different questions.  The item “So far, I have gotten everything I wanted in life” may be answered differently than the item “I feel good about my life, these days.”  If so, it is important to examine which of these items is a better measure of well-being.  It does not imply that all well-being items are flawed.  The same logic applies to the response format.  If some response formats produce different results than others, it is important to determine which response formats are better for the measurement of well-being.  Last, but not least, the claim that well-being reports are “strongly influenced by the content of preceding questions” is blatantly false.  A meta-analysis shows that strong effects were only observed in two studies by Strack, but that other studies find much weaker or no effects (see Schimmack & Oishi, 2005, for a meta-analysis).

 

Such findings are difficult to reconcile with the assumption that subjective social indicators directly reflect stable inner states of well-being (Campbell 1981) or that the reports are based on careful assessments of one’s objective conditions in light of one’s aspirations (Glatzer and Zapf 1984). Instead, the findings suggest that reports of SWB are better conceptualized as the result of a judgment process that is highly context-dependent.

 

Indeed. A selective and bias list of evidence is inconsistent with the hypothesis that well-being reports are valid measures of well-being, but this only shows that the authors misrepresent the evidence, not that well-being reports lack validity, which was carefully examined in Andrew & Withey’s book (1976), which the authors cite without mentioning the evidence presented in the book for the usefulness of well-being reports.

 

[A PREVIEW]

 

Not surprisingly, individuals may draw on a wide variety of information when asked to assess the subjective quality of their lives.

 

Indeed. This means that it is impossible to generalize from an artificial context created in an experiment to the normal conditions of a well-being survey because respondents may use different information in the experiment than in the naturalistic context. The experiment may led respondents to use information that they normally would not use.

 

[USING INFORMATION ABOUT ONE’S OWN LIFE: INTRAINDIVIDUAL COMPARISONS]

 

Comparison-based evaluative judgments require a mental representation of the object of judgment, commonly called a target, as well as a mental representation of a relevant standard to which the target can be compared.

 

True. In fact, Cantril’s ladder explicitly asks respondents to compare their actual life to the best possible life they could have and the worst possible life they could have.  We can think about these possible lives as imaginary intrapersonal comparisons.

 

When asked, “Taking all things together, how would you say things are these days?” respondents are ideally assumed to review the myriad of relevant aspects of their lives and to integrate them into a mental representation of their life as a whole.”

 

True, this is the assumption underlying the use of well-being reports as measures of well-being.

 

In reality, however, individuals rarely retrieve all information that may be relevant to a judgment

 

This is also true. It is impossible to retrieve ALL of the relevant information. But it is possible that respondents retrieve most of the relevant information or enough relevant information to make these judgments valid. We do not require 100% validity for measures to be useful.

 

Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987).

 

This is also plausible. The question is what would be the criterion for sufficient certainty for well-being judgments and whether this level of certainty is reached without retrieval of relevant information. For example, if I have to report how satisfied I am with my life overall and I am thinking first about my marriage would I stop there or would I think that my overall life is more than my marriage and also think about my work?  Depending on the answer to this question, well-being judgments may be more or less valid.

 

Hence, the judgment is based on the information that is most accessible at that point in time. In general, the accessibility of information depends on the recency and frequency of its use (for a review, see Higgins 1996).

 

This also makes sense.  A sick person may think about their health. A person in a happy marriage may think about their loving wife, and a person with financial problems may think about their problems paying bills.  Any life domain that is particularly salient in a person’s life is also likely to be a salient when they are confronted with a life-satisfaction question. However, we still do not know which information people will use and how much information they will use before they consider their judgment sufficiently accurate to provide an answer. Would they use just one salient temporarily accessible piece of information or would be continue to look for more information?

 

Information that has just been used-for example, to answer a preceding question in the questionnaire-is particularly likely to come to mind later on, although only for a limited time.

 

Wait a second.  Higgins emphasized that accessibility is driven by recency and frequency (!) of use. Individual who are going through a divorce or cancer treatment have probably thought frequently about this aspect of their lives.  A single question about their satisfaction with their recreational activities may not make them judge their lives based on their hobbies. Thus, it does not follow from Higgins’s work on accessibility that preceding items have a strong influence on well-being judgments.

 

This temporarily accessible information is the basis of most context effects in survey measurement and results in variability in the judgment when the same question is asked at different times (see Schwarz and Strack 1991b; Strack 1994a; Sudman, Bradburn, and Schwarz 1996, chs. 3 to 5; Tourangeau and Rasinski 1988)

 

Once more, the evidence for these temporary accessibility effects is weak and it is not clear why well-being judgments would be highly stable over time, if they were driven by making irrelevant information temporarily accessible.  In fact, the evidence is more consistent with Higgins’ suggests that frequent of use influences well-being judgments.  Life domains that are salient to individuals are likely to influence life-satisfaction judgments because they are chronically accessible even if other information is temporarily accessible or primed by preceding questions.

 

Other information, however, may come to mind because it is used frequently-for example, because it relates to the respondent’s current concerns (Klinger 1977) or life tasks (Cantor and Sanderson, this volume). Such chronically accessible information reflects important aspects of respondents’ lives and provides for some stability in judgments over time.

 

Indeed, but look at the wording. “This temporarily accessible information IS the basis of most context effects in survey measurement” vs. “Other information, however, MAY come to mind.”  The wording is not balanced and it does not match the evidence that most of the variation in well-being reports across individuals is stable over time and only a small proportion of the variance changes systematically over time. The wording is an example of how some scientists create the illusion of a balanced literature review while pushing their biased opinions.

 

As an example, consider experiments on question order. Strack, Martin, and Schvvarz (1988) observed that dating frequency was unrelated to students’ life satisfaction when a general satisfaction question preceded a question about the respondent’s dating frequency, r = – 12.  Yet reversing the question order increased the correlation to r = .66.  Similarly, marital satisfaction correlated with general life satisfaction r = .32 when the general question preceded the marital one in another study (Schwarz, Strack, and Mai 1991). Yet reversing the question order again increased this correlation to r = .67.

 

The studies that are cited here are not representative. They show the strongest item-order effects and the effects are much stronger than the meta-analytic average (Schimmack & Oishi, 2005). Both studies were conducted by Strack. Thus, these examples are at best considered examples what might happen under very specific conditions that differ from other specific conditions where the effect was much smaller. Moreover, it is not clear why dating frequency should be a strong positive predictor of life-satisfaction. Why is my life better when I have a lot of dates as opposed to somebody who is in a steady relationship, and we would not expect a married respondent with lots of dates to be happy with their marriage. The difference between r = .32 and r = .66 is strong, but it was obtained with small samples and it is common that small samples overestimate effect sizes. In fact, large survey studies show much weaker effects. In short, by focusing on these two examples, the authors create the illusion that strong effects of preceding items are common and that these studies are just an example of these effects. In reality, these are the only two studies with extremely and unusually strong effects that are not representative of the literature. The selective use of evidence is another example of unscientific practices that undermine a cumulative science.

 

Findings of this type indicate that preceding questions may bring information to mind that respondents would otherwise not consider.

 

Yes, it may happen, but we do not know under what specific circumstances it happens.  At present, the only predictor of these strong effects is that the studies were conducted by Fritz Strack. Nobody else has reported such strong effects.

 

If this information is included in the representation that the respondent forms of his or her life, the result is an assimilation effect, as reflected in increased correlations. Thus, we would draw very different inferences about the impact of dating frequency or marital satisfaction on overall SWB, depending on the order in which the questions are asked.

 

Now the authors extrapolate from extreme examples and discuss possible theoretical implications if this were a consistent and replicable finding.  “We would draw different inferences.”  True. If this were a replicable finding and we would ask about specific life domains first, we would end up with false inferences about the importance of dating and marriage for life-satisfaction. However, it is irrelevant what follows logically from a false assumption (if Daniel Kahneman had not won the Noble price, it would be widely accepted that money buys some happiness). Second, it is possible to ask global life-satisfaction question first without making information about specific aspects of life temporarily salient.  This simple procedure would ensure that well-being reports are more strongly influenced on chronically accessible information that reflects people’s life concerns.  After all, participants may draw on chronically accessible information or temporarily accessible information and if no relevant information was made temporarily accessible, respondents will use chronically accessible information.

 

Theoretically, the impact of a given piece of accessible information increases with its extremity and decreases with the amount and extremity of other information that is temporarily or chronically accessible at the time of judgment (see Schwarz and Bless 1992a). To test this assumption, Schwarz, Strack, and Mai ( 1991) asked respondents about their job satisfaction, leisure time satisfaction, and marital satisfaction prior to assessing their general life satisfaction, thus rendering a more varied set of information accessible. In this case, the correlation between marital satisfaction and life satisfaction increased from r = .32 (in the general-marital satisfaction order) to r = .46, yet this increase was less pronounced than the r = .67 observed when marital satisfaction was the only specific domain addressed.

 

This finding also suggests that strong effects of temporarily accessible information are highly context dependent. Just asking for satisfaction with several life-domains reduces the item order effect and with the small samples in Schwarz et al. (1991), the difference between r = .32 and r = .46 is not statistically significant, meaning it could be a chance finding.  So, their own research suggests that temporarily accessible information may typically have a small effect on life-satisfaction and this conclusion would be consistent with the evidence in the literature.

 

In light of these findings, it is important to highlight some limits for the emergence of question-order effects. First, question-order effects of the type discussed here are to be expected only when answering a preceding question increases the temporary accessibility of information that is not chronically accessible anyway…  Hence, chronically accessible current concerns would limit the size of any emerging effect, and the more they do so, the more extreme the implications of these concerns are.

 

Here the authors acknowledge that there are theoretical reasons why item-order effects should typically not have a strong influence on well-being reports.  One reason is that some information such as marital satisfaction is likely to be used even if marriage is not made salient by a preceding question.  It is therefore, not clear why marital satisfaction would produce a big increase from r = .32 to r = .67, as this would imply that numerous respondents do not consider their marriage when they made the judgment and it would explain why other studies found much weaker effects for item-order effects with marital satisfaction and higher correlations between marital satisfaction and life-satisfaction than r  =.32.  However, it is interesting that this important theoretical point is offered only as a qualification after presenting evidence from two studies that did show strong item-order effects. If the argument had been presented first, the question would arise why these studies did produce strong item-order effects and it would be evident that it is impossible to generalize from these specific studies to well-being reports in general.

 

[CONSERVATION NORMS]

 

“Complicating things further, information rendered accessible by a preceding question may not always be used.”

 

How is this complicating things further?  If there are ways to communicate to respondents that they should not be influenced by previous items (e.g., “Now on to another topic” or “take a moment to think about the most important aspects of your life”) and this makes context effects disappear, why don’t we just use the proper conversational norms to avoid these undesirable effects? And some surveys actually do this and we would therefore expect that they elicit valid reports of well-being that are not based on responses to previous questions in the survey.

 

In the above studies (Strack et al. 1988; Schwarz ct al. 1991), the conversational norm of nonrcdundancy was evoked by a joint lead-in that informed respondents that they would now be asked two questions pertaining to their well-being. Following this lead-in, they first answered the specific question (about dating frequency or marital satisfaction) and subsequently reported their general life satisfaction. In this case, the previously observed correlations of r = .66 between dating frequency and life satisfaction, or of r = .67 between marital satisfaction and life satisfaction, dropped to r = -15 and .18, respectively. Thus, the same question order resulted in dramatically different correlations, depending on the elicitation of the conversational norm of nonredundancy. 

 

The only evidence for these effects comes from a couple of studies by the authors.  Even if these results hold, they suggest that it should be possible to use conversational norms to get the same results for both item-orders if conversational norms suggest that participants should use all relevant chronically accessible information.  However, the authors did not conduct such as study. One reason may be that the prediction would be that there is no effect and researchers are only interested in using manipulations that show effects so that they can reject the null-hypothesis. Another explanation could be that Schwarz and Strack’s program of research on well-being reports was built on the heuristics and bias program in social psychology that is only interested in showing biases and ignores evidence for accuracy (Funder, 1987). The only result that is deemed relevant and worthy of publishing are experiments that successfully created a bias in judgments. The problem with this approach is that it cannot reveal that these judgments are also accurate and can be used as valid measures of well-being.

 

[SUMMARY]

 

Judgments arc based on the subset of potentially applicable information that is chronically or temporarily accessible at the time.

 

Yes, it is not clear what else the judgments could be based on.

 

Accessible information, however, may not be used when its repeated use would violate conversational norms of nonredundancy.

 

Interestingly this statement would imply that participants are not influenced by subtle information (priming). The information has to be consciously accessible to determine whether it is relevant and only accessible information that is considered relevant is assumed to influence judgments.  This also implies that making information accessible that is not considered relevant will not have an influence on well-being reports. For example, asking people about their satisfaction with the weather or the performance of a local sports team does not lead to a strong influence of this information on life-satisfaction judgments because most people do not consider this information relevant (Schimmack et al., 2002). Once more, it is not clear how well-being reports can be highly context dependent, if information is carefully screened for relevance and responses are only made when sufficient relevant information was retrieved.

 

[MENTALCONSTRUALSOFONE’S LIFE AND A RELEVANT STANDARD: WHAT IS, WAS,WILL BE, AND MIGHT HAVE BEEN]

 

Suppose that an extremely positive (or negative) life event comes to mind. If this event is included in the temporary representation of the target “my life now,” it results in a more positive (negative) assessment of SWB, reflecting an assimilation effect, as observed in an increased correlation in the studies discussed earlier. However, the same event may also be used in constructing a standard of comparison, resulting in a contrast efficient: compared to an extremely positive (negative) event, one’s life in general may seem relatively bland (or pretty benign). These opposite influences of the same event are sometimes referred to as endowment (assimilation) and contrast effects (Tversky. and Griffin 1991).

 

This is certainly a possibility, but it not necessarily limited to temporarily accessible information.  A period in an individuals’ life may be evaluated relative to other periods in a person’s life.  In this way, subjective well-being is subjective. Objectively identical lives can be evaluated differently because past experiences created different ideals or comparison standards (see Cantril’s early work on human concerns).  This may happen for chronically accessible information just as much as for temporarily accessible information and it does not imply that well-being reports are invalid; it just shows that they are subjective.

 

Strack, Schwarz, and Gschneidingcr (1985, Experiment 1) asked respondents to report either three positive or three negative recent life events, thus rendering these events temporarily accessible.  As shown in the top panel of Table 1, these respondents reported higher current life satisfaction after they recalled three positive rather than negative recent events. Other respondents, however, had to recall events that happened at least five years before. These respondents reported higher current life satisfaction after recalling negative rather than positive past events. 

 

This finding shows that contrast effects can occur.  However, it is important to note that these context effects were created by the experimental manipulation.  Participants were asked to recall events from 5 years ago.  In the naturalistic scenario, where participants are simply asked to report “how is your life these days” participants are unlikely to suddenly recall events from 5 years ago.   Similarly, if you were asked about your happiness with your last vacation you are unlikely to recall earlier vacations and contrast your most recent vacation with it.  Indeed, Suh et al. (1996) showed that life-satisfaction judgments are influenced by recent events and that older events do not have an effect. They found no evidence for contrast effects when participants were not asked to recall events from the distant past.  So, this research shows what can happen in a specific context where participants were to recall extreme negative or positive from their past, but without prompting by an experimenter this context hardly ever would occur.  Thus, this study has no ecological or external validity for the question how participants actually make life-satisfaction judgments.

 

These experimental results are consistent with correlational data (Elder 1974) indicating that U.S. senior citizens, the “children of the Great Depression,” are more likely to report high subjective well-being the more they suffered under adverse economic conditions when they were adolescents. 

 

This finding again does not mean that elderly US Americans who suffered more during the Great Depression were actively thinking about the Great Depression when they answered questions about their well-being. It is more likely that they may have lower aspirations and expectations from life (see Easterlin). This means that we can interpret this result in many ways. One explanation would be that well-being judgments are subjective and that cultural and historic events can shape individuals’ evaluation standards of their lives.

 

[SUMMARY]

 

In combination, the reviewed research illustrates that the same life event may affect judgments of SWB in opposite directions, depending on its use in the construction of the target “my life now” and of a relevant standard of comparison.

 

Again, the word “may” makes this statement true. Many things may happen, but that tells us very little about what actually is happening when respondents report on their well-being.  How past negative events can become positive events (a divorce was terrible, but it feels like a blessing after being happily remarried, etc.) and positive events can become negative events (e.g., the dream of getting tenure comes true, but doing research for life happens to be less fulfilling than one anticipated) is an interesting topic for well-being research, but none of these evaluative reversals undermine the usefulness of well-being measures. In fact, they are needed to reveal that subjective evaluations have changed and that past evaluations may have carry over effects on future evaluations.

 

It therefore comes as no surprise that the relationship between life events and judgments of SWB is typically weak. Today’s disaster can become tomorrow’s standard, making it impossible to predict SWB without a consideration of the mental processes that determine the use of accessible information.

 

Actually, the relationship between life-events and well-being is not weak.  Lottery winners are happier and accident victims are unhappier.  And cross-cultural research shows that people do not simply get used to terrible life circumstances.  Starving is painful. It does not become a normal standard for well-being reports on day 2 or 3.  Most of the time, past events simply lose importance and are replaced by new events and well-being measures are meant to cover a certain life period rather than an individual’s whole life from birth to death.  And because subjective evaluations are not just objective reports of life-events, they depend on mental processes. The problem is that a research program that uses experimental manipulations does not tell us about the mental processes that are underlying life-satisfaction judgments when participants are not manipulated.

 

[WHAT MIGHT HAVE BEEN: COUNTERFACTUALS]

 

Counterfactual thinking can influence affect and subjective well-being in several ways (see Roese 1997; Roese and Olson 1995b).

 

Yes, it can, it may, and it might, but the real question is whether it does influence well-being reports and if so, how it influences these reports.

 

For example, winners of Olympic bronze medals reported being more satisfied than silver medalists (Medvec, Madey, and Gilovich 1995), presumably because for winners of bronze medals, it is easier to imagine having won no medal at all (a “downward counterfactual”), while for winners of silver medals, it is easier to imagine having won the gold medal (an “upward counterfactual”).

 

This is not an accurate summary of the article that contained three studies.  Study 1 used ratings of video clips of Olympic medalists immediately after the event (23 silver & 18 bronze medalists).  The study showed a strong effect that bronze medalists were happier than silver medalists, F(1,72) = 18.98.  The authors also noted that in some events the silver medal means that an athlete lost a finals match, whereas in other events they just placed second in a field of 8 or more athletes.  An analysis that excluded final matches showed weaker evidence for the effect, F(1,58) = 6.70.  Most important, this study did not include subjective reports of satisfaction as claimed in the review article. Study 2 examined interviews of 13 silver and 9 bronze medalists.  Participants in Study 2 rated interviews of silver medal athletes to contain more counter-factual statements (e.g., I almost), t(20) = 2.37, p< ,03.  Importantly, no results regarding satisfaction are reported. Study 3 actually recruited athletes for a study and had a larger sample size (N = 115). Participants were interviewed by the experimenters after they won a silver or bronze medal at an athletic completion (not the Olympics).   The description of the procedure is presented verbatim here.

 

Procedure. The athletes were approached individually following their events and asked to rate their thoughts about their performance on the same 10-point scale used in Study 2. Specifically, they were asked to rate the extent to which they were concerned with thoughts of “At least I. . .” (1) versus “/ almost” (10). Special effort was made to ensure that the athletes understood the scale before making their ratings. This was accomplished by mentioning how athletes might have different thoughts following an athletic competition, ranging from “I almost did better” to “at least I did this well.”

 

What is most puzzling about this study is why the experiments seemingly did not ask questions about emotions or satisfaction with performance.  It would have taken only a couple of questions to obtain reports that speak to the question of the article whether winning a silver medal is subjectively better than winning a bronze medal.  Alas, these questions are missing. The only result from Study 3 is “as predicted, silver medalists’ thoughts following the competition were more focused on “I almost” than were bronze medalists’.  Silver medalists described their thoughts with a mean rating of 6.8 (SD = 2.2), whereas bronze medalists assigned their thoughts an average rating of 5.7 (SD = 2.7), t(113) = 2.4, p < .02.

 

In sum, there is no evidence in this study that winning an Olympic silver medal or any other silver medal for that matter makes athletes less happy than winning a bronze medal. The misrepresentation of the original study by Schwarz and Strack is another example of unscientific practices that can lead to the fabrication of false facts that are difficult to correct and can have a lasting negative effect on the creation of a cumulative science.

 

In summary, judgments of SWB can be profoundly influenced by mental constructions of what might have been.

 

This statement is blatantly false. The cited study on medal winners does not justify this claim and thre is no scientific basis for the claim that these effects are profound.

 

In combination, the discussion in the preceding sections suggests that nearly any aspect of one’s life can be used in constructing representations of one’s “life now” or a relevant standard, resulting in many counterintuitive findings.

 

A collection of selective findings that were obtained using different experimental procedures does not mean that well-being reports obtained under naturalistic conditions produce many counterintuitive findings, nor is there any evidence that they do produce many counterintuitive findings.  This statement lacks any empirical foundation and is inconsistent with other findings in the well-being literature.

 

Common sense suggests that misery that lasts for years is worse than misery that lasts only for a few days.

 

Indeed. Extended periods of severed depression can drive some people to attempt suicide. A week with the flu does not. Consistent with this common sense observation, well-being reports of depressed people are much lower than those of other people, once more showing that well-being reports often produce results that are consistent with intuitions.

 

Recent research suggests, however, that people may largely neglect the duration of the episode, focusing instead on two discrete data points, namely, its most intense hedonic moment (“peak”) and its ending (Fredrickson and Kahneman 1993; Varey and Kahneman 1992). Hence, episodes whose worst (or best) moments and endings are of comparable intensity are evaluated as equally (un)pleasant, independent of their duration (for a more detailed discussion, see Kahneman, this volume).

 

Yes, but this research focusses on brief episodes with a single emotional event.  It is interesting that duration of episodes seems to matter very little, but life is a complex series of events and episodes. Having sex for 20 minutes or 30 minutes may not matter, but having sex regularly, at least once a week, does seem to matter for couples’ well-being.  As Diener et al. (1985) noted, it is the frequency, not the intensity (or duration) of positive and negative events in people’s lives that matters.

 

Although the data are restricted to episodes of short duration, it is tempting to speculate about the possible impact of duration neglect on the evaluation of more extended episodes.

 

Yes, interesting, but this statement clearly indicates that the research on duration neglect is not directly relevant for well-being reports.

 

Moreover, retrospective evaluations should crucially depend on the hedonic value experienced at the end of the respective episode.

 

This is a prediction not a fact. I have actually examined this question and found that frequency of positive and negative events has a stronger influence on satisfaction judgments with a day than how respondents felt at the end of the day when they reported daily satisfaction.

 

[SUMMARY]

 

As our selective review illustrates, judgments of SWB are not a direct function of one’s objective conditions of life and the hedonic value of one’s experiences.

 

First, it is great that the authors acknowledge here that their review is selective.  Second, we do not need a review to know that subjective well-being is not a direct function of objective life conditions. The whole point of subjective well-being reports is to allow respondents to evaluate these events from their own subjective point of view.  And finally, at no point has this selective review shown that these reports do not depend on the hedonic value of one’s experiences. In fact, measures of hedonic experiences are strong predictors of life-satisfaction judgments (Schimmack et al., 2002; Lucas et al., 1996; Zou et al., 2012).

 

Rather they crucially depend on the information that is accessible at the time of judgment and how this information is used in constructing mental representations of the to-be-evaluated episode and a relevant standard.

 

This factual statement cannot be supported by a selective review of the literature. You cannot say, my selective review of creationist literature shows that evolution theory is wrong.  You can say that a selective review of creationist literature would suggest that evolution theory is wrong, but you cannot say that it is wrong. To make scientific statements about what is (highly probable to be) true and what is (highly probable to be) false, you need to conduct a review of the evidence that is not selective and not biased.

 

As a result of these construal processes, judgments of SWB are highly malleable and difficult to predict on the basis of objective conditions. 

 

This is not correct.  Evaluations do not directly depend on objective conditions. This is not a feature of well-being reports but a feature of evaluations.  At the same time, the construal processes that related objective events to subjective well-being are systematic, predictable, and depend on chronically accessible and stable information.  Well-being reports are highly correlated with objective characteristics of nations, bereavement, unemployment, and divorce have negative effects on well-being and winning the lottery, marriage, and remarriage have positive effects on well-being.  Schwarz and Strack are fabricating facts. This is not considered fraud. Only data manipulation and fabricating data is considered scientific fraud, but this does not mean that fabricated facts are less harmful than fabricated data.  Science can only provide a better understanding if it is based on empirically verified and replicable facts. Simply stating ‘judgments of SWB are difficult to predict” without providing any evidence for this claim is unscientific.

 

[USING INFORMATION ABOUT OTHERS: SOCIAL COMPARISONS]

 

The causal impact of comparison processes has been well supported in laboratory experiments that exposed respondents to relevant comparison standards…For example, Strack and his colleagues (1990) observed that the mere presence of a handicapped confederate was sufficient to increase reported SWB under self-administered questionnaire conditions, presumably because the confederate served as a salient standard of comparison….As this discussion indicates, the impact of social comparison processes on SWB is more complex than early research suggested. As far as judgments of global SWB are concerned, we can expect that exposure to someone who is less well off will usually result in more positive-and to someone who is better off in more negative assessments of one’s own life.  However, information about the other’s situation will not always be used as a comparison standard.

The whole section about social comparison does not really address the question of the influence of social comparison effects on well-being reports.  Only a single study with a small sample is used to provide evidence that respondents may engage in social comparison processes when they report their well-being.  The danger of this occurring in a naturalistic context is rather slim.  Even in face-to-face interviews, the respondent is likely to have answered several questions about themselves and it seems far-fetched that they would suddenly think about the interviewer as a relevant comparison standard, especially if the interviewer does not have a salient characteristic like a disability that may be considered relevant. Once more the authors generalize from one very specific laboratory experiment to the naturalistic context in which SWB reports are normally made without considering the possibility that the experimental results are highly contextual sensitive and do not reveal how respondents normally judge their lives.

[Standards Provided by the Social Environment]

In combination, these examples draw attention to the possibility that salient comparison standards in one’s immediate environment, as well as socially shared norms, may constrain the impact of fortuitous temporary influences. At present, the interplay of chronically and temporarily accessible standards on judgments of SWB has received little attention. The complexities that are likely to result from this interplay provide a promising avenue for future research.

Here the authors acknowledge that their program of research is limited and fails to address how respondents use chronically accessible information. They suggest that this is a promising avenue for future research, but they fail to acknowledge why they haven’t conducted studies that start to address this question. The reason is that their research program with experimental manipulations of the situation doesn’t allow to study the use of chronically accessible information.  The use of information that by definition comes to mind spontaneously independent of researchers’ experimental manipulations is a blind-spot of the experimental approach.

[Interindividual Standards Implied by the Research Instrument]

Finally, we extend our look at the influences of the research instrument by addressing a frequently overlooked source of temporarily accessible comparison information…As numerous studies have indicated (for a review, see Schwarz 1996), respondents assume that the list of response alternatives reflects the researcher’s knowledge of the distribution of the behavior: they assume that the “average” or “usual” behavioral frequency is represented by values in the middle range of the scale, and that the extremes of the scale correspond to the extremes of the distribution. Accordingly, they use the range of the response alternatives as a frame of reference in estimating their own behavioral frequently, resulting in different estimates of their own behavioral frequency, as shown in table 4.2. More important for our present purposes, they further extract comparison information from their low location on the scale…Similar findings have been obtained with regard to the frequency of physical symptoms and health satisfaction (Schwarz and Scheuring 1992), the frequency of sexual behaviors and marital satisfaction (Schwarz and Scheuring 1988), and various consumer behaviors (Menon, Raghubir, and Schwarz 1995).

One study is in German and not available.  I examined the study by Schwarz and Scheurig (1988) in European Journal of Social Psychology.   Study 1 had four conditions with n = 12 or 13 per cell (N = 51).  The response format varied frequencies so that having sex or masturbating once a week was either a high or low frequency occurrence.  Subsequently, participants reported their relationship satisfaction. The relationship satisfaction ratings were analyzed with an ANOVA.  “Analysis of variance indicates a marginally reliable interaction of both experimental variables, F(1,43) = 2.95, p < 0.10, and no main effects.”  The result is not significant by conventional standards and the degrees of freedom show that some participants were excluded from this analysis without further mentioning of this fact.  Study 2 manipulated the response format for frequency of sex and masturbation within subject. That is, all subjects were asked to rate frequencies of both behaviors in four different combinations. There were n = 16 per cell, N = 64. No ANOVA is reported presumably because it was not significant. However, a PLANNED contrast between the high sex/low masturbation and the low/sex high masturbation group showed a just significant result, t(58) = 2.17, p = .034. Again, the degrees of freedom do not match sample size. In conclusion, the evidence that subtle manipulations of response formats can lead to social comparison processes that influence well-being reports is not conclusive. Replication studies with larger samples would be needed to show that these effects are replicable and to determine how strong these effects are.

In combination, they illustrate that response alternatives convey highly salient comparison standards that may profoundly affect subsequent evaluative judgments.

Once, more the word “may” makes the statement true in a trivial sense that many things may happen. However, there is no evidence that these effects actually have profound effects on well-being reports, and the existing studies show statistically weak evidence and provide no information about the magnitude of these effects.

Researchers are therefore well advised to assess information about respondents’ behaviors or objective conditions in an open-response format, thus avoiding the introduction of comparison information that respondents would not draw on in the absence of the research instrument.

There is no evidence that this would improve the validity of frequency reports and research on sexual frequency shows similar results with open and closed measures of sexual frequency (Muise et al., 2016).

[SUMMARY]

In summary, the use of interindividual comparison information follows the principle of cognitive accessibility that WC have highlighted in our discussion of intraindividual comparisons. Individuals often draw on the comparison information that is rendered temporarily accessible by the research instrument or the social context in which they form the judgment, although chronically accessible standards may attenuate the impact of temporarily accessible information.

The statement that people often rely on interpersonal comparison standards is not justified by the research.  By design, experiments that manipulate one type of information and make it salient cannot determine how often participants use this type of information when it is not made salient.

[THE IMPACT OF MOOD STATES]

In the preceding sections, we considered how respondents use information about their own lives or the lives of others in comparison-based evaluation strategies. However, judgments of well-being are a function not only of what one thinks about but also of how one feels at the time of judgment.

Earlier, the authors stated that respondents are likely to use a minimum of information that is deemed sufficient. “Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987)”  Now we are supposed to believe that they use intrapersonal and interpersonal information that is temporarily and chronically accessible and their feelings.  That is a lot of information and it is not clear how all of this information is combined into a single judgment. A more parsimonious explanation for the host of findings is that each experiment carefully created a context that made respondents use the information that the experimenters wanted respondents to use to confirm the hypothesis that they use this information. The problem is that this only shows that a particular source of information may be used in one particular context. It does not mean that all of these sources of information are used and need to be integrated into a single judgment under naturalistic conditions. The program of research simply fails to address the question which information respondents actually use when they are asked to judge their well-being in a normal context.

A wide range of experimental data confirms this intuition. Finding a dime on a copy machine (Schwarz 1987), spending time in a pleasant rather than an unpleasant room (Schwarz ct al. 1987, Expcrimcnt 2), or watching the German soccer team win rather than lose a championship game (Schwarz et al. 1987, Experiment 1) all resulted in increased reports of happiness and satisfaction with one’s life as a whole…Experimental evidence supports this assumption. For example, Schwarz and Clore (1983, Experiment 2) called respondents on sunny or rainy days and assessed reports of SWB in telephone interviews. As expected, respondents reported being in a better mood, and being happier and more satisfied with their life as a whole, on sunny rather than on rainy days. Not so, however, when respondents’ attention was subtly drawn to the weather as a plausible cause of their current feelings.

The problem is that all of the cited studies were conducted by Schwarz and that other studies that produced different results are not mentioned.  The famous weather study has recently been called into question.  However, the weather effect on life-satisfaction judgments is not ideal because weather effects on mood are not very strong either.  Respondents in sunny California do not report higher life-satisfaction than respondents in Ohio (Schkade & Kahneman), and several large scale studies have now failed to replicate the famous weather effect on well-being reports (Lucas & Lawless, 2013; Schmiedeberg, 2014).

On theoretical grounds, we may assume that people are more likely to use the simplifying strategy of consulting their affective state the more burdensome it would be to form a judgment on the basis of comparison information.

Here it is not clear why it would be burdensome to make global life-satisfaction judgments. The previous chapters suggested that respondents have access to large amount of chronically and temporarily information that they apparently used in the previous studies. Suddenly, it is claimed that retrieving relevant information is too hard and mood is used. It is not clear why respondents would consider their current mood sufficient to evaluate their lives, especially if inconsistent accessible information also comes to mind.

Note in this regard that evaluations of general life satisfaction pose an extremely complex task that requires a large number of comparisons along many dimensions with ill-defined criteria and the subsequent integration of the results of these comparisons into one composite judgment. Evaluations of specific life domains, on the other hand, are often less complex.

If evaluations of specific life domains are less complex and global questions are just an average of specific domains, it is not clear why it would be so difficult to evaluate satisfaction in a few important life domains (health, family, work) and integrate this information.  The hypothesis that mood is only used as a heuristic for global well-being reports also suggests that it would be possible to avoid the use of this heuristic by asking participants to report satisfaction with specific life domains. As these questions are supposed to be easier to answer, participants would not use mood. Moreover, preceding items are less likely to make information accessible that is relevant for a specific life domain.  For example, a dating question is irrelevant for satisfaction with academic or health satisfaction.  Thus, participants are most likely to draw on chronically accessible information that is relevant for answering a question about satisfaction with specific domains. It follows that averages of domain satisfaction judgments would be more valid than global judgments, if participants were relying on mood to judge global judgments. For example, finding a dime would make people judge their lives more positively, but not their health, social relationships, and income.  Thus, many of the alleged problems with global well-being reports could be avoided by asking for domain specific reports and then aggregate them (Andrews & Whithey, 1976; Zou et al., 2013).

If judgments of general well-being are based on respondents’ affective state, whereas judgments of domain satisfaction are based on comparison processes, it is conceivable that the same event may influence evaluations of one’s life as a whole and evaluations of specific domains in opposite directions. For example, an extremely positive event in domain X may induce good mood, resulting in reports of increased global SWB. However, the same event may also increase the standard of comparison used in evaluating domain X, resulting in judgments of decreased satisfaction with this particular domain. Again, experimental evidence supports this conjecture. In one study (Schwarz ct al. 1987, Experiment 2), students were tested in cither a pleasant or an unpleasant room, namely, a friendly office or a small, dirty laboratory that was overheated and noisy, with flickering lights and a bad smell. As expected, participants reported lower general life satisfaction in the unpleasant room than in the pleasant room, in line with the moods induced by the experimental rooms. In contrast, they reported higher housing satisfaction in the unpleasant than in the pleasant room, consistent with the assumption that the rooms served as salient standards of comparison.

The evidence here is a study with 22 female students assigned to two conditions (n = 12 and 10 per condition).  The 2 x 2 ANOVA with room (pleasant vs. unpleasant) and satisfaction judgment (life vs. housing) produced a significant interaction of measure and room, F(1,20) = 7.25, p = .014.  The effect for life-satisfaction was significant, F(1,20) = 8.02, p = .010 (reported as p < .005), and not significant for housing satisfaction, F(1,20) = 1.97, p = .18 (reported as p < .09 one-tailed).

This weak evidence in a single study with a very small sample is used to conclude that life-satisfaction judgments and domain satisfaction judgments may diverge.  However, numerous studies have shown high correlations between average domain satisfaction judgments and global life-satisfaction judgments (Andrews & Whithey, 1976; Schimmack & Oishi, 2005; Zou et al., 2013).  This finding cannot occur if respondents use mood for life-satisfaction judgments and other information for domain satisfaction judgments.  Yet readers are not informed about this finding that undermines Schwarz and Stracks’ model of well-being reports and casts doubt on the claim that the same information has opposite effects on global life-satisfaction judgments and domain specific judgments. This may happen in highly artificial laboratory conditions, but it does not happen often in normal survey contexts.

The Relative Salience of Mood and Competing Information

If recalling a happy or sad life event elicits a happy or sad mood at the time of recall, however, respondents are likely to rely on their feelings rather than on recalled content as a source of information. This overriding impact of current feelings is likely to result in mood-congruent reports of SWB, independent of the mental construal variables discussed earlier. The best evidence for this assumption comes from experiments that manipulated the emotional involvement that subjects experienced while thinking about past life events.

This section introduces a qualification of the earlier claim that recall of events in the remote past leads to a contrast effect.  Here the claim is that recalling a positive event from the remote past (a happy time with a deceased spouse) will not lead to a contrast effect (intensify dissatisfaction of a bereaved person), if the recall of the event triggers an actual emotional experiences (My life these days is good because I feel good when I think about the good times in the past).  The problem with this theory is that it is inconsistent with the earlier claims that people will discount their current feelings if they think they are irrelevant. If respondents do not use mood to judge their lives when they attribute it to the weather, it is not clear why they would use their feelings if they are triggered by recall of an emotional event from their past?  Why would a widower evaluate his current life as a widower more favorably when he is recalling the good times with his wife?

Even if this were a reliable finding, it would be practically irrelevant for actual ratings of life-satisfaction because respondents are unlikely to recall specific events in sufficient detail to elicit strong emotional reactions.  The studies that demonstrated the effect instructed participants to do so, but under normal circumstances participants make judgments very quickly often without recall of detailed, specific emotional episodes.  In fact, even the studies that showed these effects showed only weak evidence that recall of emotional events had notable effects on mood (Strack et al.. 1985).

[REPORTING THE JUDGMENT]

Self-presentation and social desirability concerns may arise at the reporting stage, and respondents may edit their private judgment before they communicate it

True. All subjective ratings are susceptible to reporting styles. This is why it is important to corroborate self-ratings of well-being with other evidence such as informant ratings of well-being.  However, the problem of reporting biases would be irrelevant, if the judgment without these biases is already valid. A large literature on reporting biases in general shows that these biases account for a relatively small amount of the total variance in ratings. Thus, the key question remains whether the remaining variance provides meaningful information about respondents’ subjective evaluations of their lives or whether this variance reflects highly unreliable and context-dependent information that has no relationship to individuals’ subjective well-being.

[A JUDGMENT MODEL OF SUBJECTIVE WELL-BEING]

Figure 4.2 summarizes the processes reviewed in this chapter. If respondents are asked to report their happiness and satisfaction with their “life as a whole,” they are likely to base their judgment on their current affective state; doing so greatly simplifies the judgmental task.

As noted before, this would imply that global well-being reports are highly unstable and strongly correlated with measures of current mood, but the empirical evidence does not support these predictions.  Current mood has a small effect on global well-being reports (Eid & Diener, 2004) and they are highly stable (Schimmack & Oishi, 2005) and predicted by personality traits even when these traits are measured a decade before the well-being reports (Costa & McCrae, 1980).

If the informational value of their affective state is discredited, or if their affective state is not pronounced and other information is more salient, they are likely to use a comparison strategy. This is also the strategy that is likely to be used for evaluations of less complex specific life domains.

Schwarz and Strack’s model would allow for weak mood effects. We only have to make the plausible assumption that respondents often have other information to judge their lives and that they find this information more relevant than their current feelings.  Therefore, this first stage of the judgment model is consistent with evidence that well-being judgments are only weakly correlated with mood and highly stable over time.

When using a comparison strategy, individuals draw on the information that is chronically or temporarily most accessible at that point in time. 

Apparently the term “comparison strategy” is now used to refer to the retrieval of any information rather than an active comparison that takes place during the judgment process.  Moreover, it is suddenly equally plausible that participants draw on chronically accessible information or on temporarily accessible information.  While the authors did not review evidence that would support the use of chronically accessible information, their model clearly allows for the use of chronically accessible information.

Whether information that comes to mind is used in constructing a representation of the target  “my life now” or a representation of a relevant standard depends on the variables that govern the use of information in mental construal (Schwarz and Bless 1992a; Strack 1992). 

This passage suggests that participants have to go through the process of evaluating their live each time when they are asked to make a well-being report. They have to construct what their live is like, what they want from life, and make a comparison. However, it is also possible that they can draw on previous evaluations of life domains (e.g., I hate my job, I am healthy, I love my wife, etc.). As life-satisfaction judgments are made rather quickly within a few seconds, it seems more plausible that some pre-established evaluations are retrieved than to assume that complex comparison processes are being made at the time of judgments.

If the accessibility of information is due to temporary influences, such as preceding questions in a questionnaire, the obtained judgment is unstable over time and a different judgment will be obtained in a different context.

This statement makes it obvious that retest correlations provide direct evidence on the use of temporarily accessible information.  Importantly, low retest stability could be caused by several factors (e.g. random responding).  So, we cannot verify that participants rely on temporarily accessible information when retest correlations are low. However, we can use high retest stability to falsify the hypothesis that respondents rely heavily on temporarily accessible information because the theory makes the opposite prediction.  It is therefore highly relevant that retest correlations show high temporal consistency in global well-being reports.  Based on this solid empirical evidence we can infer that responses are not heavily influenced by temporarily accessible information (Schimmack & Oishi, 2005).

On the other hand, if the accessibility of information reflects chronic influences such as current concerns or life tasks, or stable characteristics of the social environment, the judgment is likely to be less context dependent.

This implies that high retest correlations are consistent with the use of chronically accessible information, but high retest correlations do not prove that participants use chronically accessible information. It is also possible that stable variance is due to reporting styles. Thus, other information is needed to test the use of chronically accessible information. For example, agreement in well-being reports by several raters (self, spouse, parent, etc.) cannot be attributed to response styles and shows that different raters rely on the same chronically accessible information to provide well-being reports (Schneider & Schimmack, 2012).

The size of context-dependent assimilation effects increases with the amount and extremity of the temporarily accessible information that is included in the representation of the target. 

This part of the model would explain why experiments and naturalistic studies often produce different results. Experiments make temporarily accessible information extremely salient, which may lead participants to use it. In contrast, such extremely salient information is typically absent in naturalistic studies, which explains why chronically accessible information is used. The results are only inconsistent if results from experiments with extreme manipulations are generalized to normal contexts without these extreme conditions.

[METHODOLOGICAL IMPLICATIONS]

Our review emphasizes that reports of well-being are subject to a number of transient influences. 

This is correct. The review emphasized evidence from the authors’ experimental research that showed potential threats to the validity of well-being judgments. The review did not examine how serious these threats are for the validity of well-being judgments.

Although the information that respondents draw on reflects the reality in which they live, which aspects of this reality they consider and how they use these aspects in forming a judgment is profoundly influenced by features of the research instrument.

This statement is blatantly false.  The reviewed evidence suggests that the testing situation (a confederate, a room) or an experimental manipulation (recall positive or negative events) can influence well-being reports. There was very little evidence that the research instrument influenced well-being reports and there was no evidence that these effects are profound.

[Implications for Survey Research]

The reviewed findings have profound methodological implications.

This is wrong. The main implication is that researchers have to consider a variety of potential threats to the validity of well-being judgments. All of these threats can be reduced and many survey studies do take care to avoid some of these potential problems.

First, the obtained reports of SWB are subject to pronounced question-order effects because the content of preceding questions influences the temporary accessibility of relevant information.

As noted earlier, this was only true in two studies by the authors. Other studies do not replicate this finding.

Moreover, questionnaire design variables, like the presence or absence of a joint lead-in to related questions, determine how respondents use the information that comes to mind. As a result, mean reported well-being may differ widely, as seen in many of the reviewed examples

The dramatic shifts in means are limited to experimental studies that manipulated lead-ins to demonstrate these effects. National representative surveys show very similar means year after year.

Moreover, the correlation between an objective condition of life (such as dating frequency) and reported SWB can run anywhere from r = – .l to r = .6, depending on the order in which the same questions are asked (Strack et al. 1988), suggesting dramatically different substantive conclusions.

Moreover?  This statement just repeats the first false claim that question order has profound effects on life-satisfaction judgments.

Second, the impact of information that is rendered accessible by preceding questions is attenuated the more the information is chronically accessible (see Schwarz and Bless 1992a).

So, how can we see pronounced item-order effects for marital satisfaction if marital satisfaction is a highly salient and chronically accessible aspects of married people’s lives? So, this conclusion directly undermines the previous claim that item-order has profound effects.

Third, the stability of reports of SWB over time (that is, their test-retest reliability) depends on the stability of the context in which they are assessed. The resulting stability or change is meaningful when it reflects the information that respondents spontaneously consider because the same, or different, concerns are on their mind at different points in time. 

There is no support for this claim. If participants draw on chronically accessible information, which the authors’ model allows, the judgments do not depend on the stability of the context because chronically accessible information is by definition context-independent.

Fourth, in contrast to influences of the research instrument, influences of respondents’ mood at the time of judgment are less likely to result in systematic bias. The fortuitous events that affect one respondent’s mood are unlikely to affect the mood of many others.

This is true, but it would still undermine the validity of the judgments.  If participants rely on their current mood, variation in these responses will be unreliable and unreliable measures are by definition invalid. Moreover, the average mood of participants during the time of a survey is also not a valid measure of average well-being. So, even though mood effects may not be systematic, they would undermine the validity of well-being reports. Fortunately, there is no evidence that mood has a strong influence on these judgments, while there is evidence that participants draw on chronically accessible information from important life domains (Schimmack & Oishi, 2005).

Hence, mood effects are likely to introduce random variation.

Yes, this is a correct prediction, but evidence contradicts this prediction, and the correct conclusion is that mood does not introduce a lot of random variation in well-being reports because it is not heavily used by respondents to evaluate their lives or specific aspects of their lives.

Fifth, as our review indicates, there is no reason to expect strong relationships between the objective conditions of life and subjective assessments of well-being under most circumstances.

There are many reasons not to expect strong correlations between life-events and well-being reports. One reason is that a single event is only a small part of a whole life and that few life events have such dramatic effects on life-satisfaction that they make any notable contribution to life-satisfaction judgments.  Another reason is that well-being is subjective and the same life event can be evaluated differently by different individuals. For example, the publication of this review in a top journal in psychology would have different effects on my well-being and on the well-being of Schwarz and Strack.

Specifically, strong positive relationships between a given objective aspect of life and judgments of SWB are likely to emerge when most respondents include the relevant aspect in the representation that they form of their life and do not draw on many other aspects. This is most likely to be the case when (a) the target category is wide (“my life as a whole”) rather than narrow (a more limited episode, for example); (b) the relevant aspect is highly accessible; and (c) other information that may be included in the representation of the target is relatively less accessible. These conditions were satisfied, for example, in the Strack, Martin, and Schwarz (1988) dating frequency study, in which a question about dating frequency rendered this information highly accessible, resulting in a correlation of r = .66 with evaluations of the respondent’s life as a whole. Yet, as this example illustrates, we would not like to take the emerging correlation seriously when it reflects only the impact of the research instrument, as indicated by the fact that the correlation was r = – .l if the question order was reversed.

The unrepresentative extreme result from Strack’s study is used again as evidence, when other studies do not show the effect (Schimmack & Oishi, 2005).

Finally, it is worth noting that the context effects reviewed in this chapter limit the comparability of results obtained in different studies. Unfortunately, this comparability is a key prerequisite for many applied uses of subjective social indicators, in particular their use in monitoring the subjective side of social change over time (for examples see Campbell 198 1; Glatzer and Zapf 1984).

This claim is incorrect. The experimental demonstrations of effects under artificial conditions that were needed to manipulate judgment processes do not have direct implications for the way participants actually judge well-being and the authors model allows for chronically accessible information to have a strong influence on these judgments under less extreme and less artificial conditions, and the authors model makes predictions that are disconfirmed by evidence of high stability and low correlations with mood.

Which Measures Are We to Use?

By now, most readers have probably concluded that there is little to be learned from self-reports of global well-being being.

If so, the authors succeeded with their biased presentation of the evidence to convince readers that these reports are highly susceptible to a host of context effects that make the outcome of the judgment process essentially unpredictable. Readers would be surprised to learn that well-being reports of twins who never met are positively correlated (Lykken & Tellgen, 1996).

Although these reports do reflect subjectively meaningful assessments, what is being assessed, and how, seems too context dependent to provide reliable information about a population’s well- being, let alone information that can guide public policy (but see Argyle, this volume, for a more optimistic take).

The claim that well-being reports are too context dependent to provide reliable information about a population’s well-being is false for several reasons.  First, the authors did not show that well-being reports are context dependent. They showed that with very extreme manipulations in highly contrived and unrealistic contexts, judgments moved around statistically significantly in some studies.  They did not show that these shifts are large, as it would require larger samples to estimate effect sizes. They did not show that these effects have a notable influence on well-being reports in actual surveys of populations well-being.  And finally, they already pointed out that some of these effects (e.g., mood effects) would only add random noise, which would lower the reliability of individuals’ well-being reports, but when aggregated across responses would not alter the mean of a sample. And last, but not least, the authors blatantly ignore evidence (reviewed in this volume by Diener and colleagues) that variation across nationally representative samples shows highly reliable variation across populations in different nations that are highly correlated with objective life circumstances that are correlated with nations’ wealth.

In short, Schwarz and Strack’s claims are not scientifically founded and merely express the authors’ pessimistic take on the validity of well-being reports.  This pessimistic view is a direct consequence of a myopic focus on laboratory experiments that were designed to invalidate well-being reports and ignoring evidence from actual well-being surveys that are more suitable to examine the reliability and validity of well-being reports when well-being reports are provided under naturalistic conditions.

As an alternative approach, several researchers have returned to Bentham’s ( 1789/1948) notion of happiness as the balance of pleasure over pain (for examples, SW Kahneman, this volume; Parducci 1995).

This statement ignores the important contribution of Diener (1984) who argued that the concept of well-being may consist of life evaluations as well as the balance of pleasure over pain or Positive Affect and Negative Affect, as these constructs are called in contemporary psychology. As a result of Diener’ s (1984) conception of well-being as a construct with three components, researchers have routinely measured global life-evaluations along with measures of positive and negative affect. A key finding is that these measures are highly correlated, although not perfectly identical (Lucas et al., 1996; Zou et al., 2013).  Schwarz and Strack ignore this evidence, presumably because it would undermine their view that global life-satisfaction judgments are highly context sensitive and that measures of positive and negative affect could produce notably different results.

END OF REVEW: CONCLUSIONS

In conclusion, Schwarz and Strack’s (1999) chapter is a prototypical example of several bad scientific practices.  First, the authors conduct a selective review of the literature that focuses on one specific paradigm and ignores evidence from other approaches.  Second, the review focuses strongly on original studies conducted by the authors themselves and ignores studies by other researchers that produced different results. Third, the original studies are often obtained with small samples and there are no independent replications by other researchers, but the results are discussed as if they are generalizable.  Fourth, life-satisfaction judgments are influenced by a host of factors and any study that focuses on one possible predictor of these judgments is likely to account for only a small amount of the variance. Yet, the literature review does not take effect sizes into account and the theoretical model overemphasizes the causes that were studied and ignores causes that were not studied.  Fifth, the experimental method has the advantage of isolating single causes, but it has the disadvantage that results cannot be generalized to ecologically valid contexts in which well-being reports are normally obtained. Nevertheless, the authors generalize from artificial experiments to the typical survey context without examining whether their predictions are confirmed.  Finally, the authors make broad and profound claims that do not logically follow from their literature review. They suggest that decades of research with global well-being reports can be dismissed because the measures are unreliable, but these claims are inconsistent with a mountain of evidence that shows the validity of these measures that the authors willfully ignore (Diener et al., 2009).

Unfortunately, the claims in this chapter were used by Noble Laureate Daniel Kahneman as arguments to push for an alternative conception and measurement of well-being.  In combination, the unscientific review of the literature and the political influence of a Noble price has had a negative influence on well-being science.  The biggest damage to the field has been the illusion that the processes underlying global well-being reports are well-understood. In fact, we know very little how respondents make these judgments and how accurate these judgments are.  The chapter lists a number of possible threats to the validity of well-being reports, but it is not clear how much these threats actually undermine the validity of well-being reports and what can be done to reduce biases in these measures to improve their validity.  A program that relies exclusively on experimental manipulations that create biases in well-being reports is unable to answer these questions because well-being judgments can be made in numerous ways and results that are obtained in artificial laboratory contexts may or may not generalize to the context that is most relevant, namely when well-being reports are used to measure well-being.

What is needed is a real scientific program of research that examines accuracy and biases in well-being reports and creates well-being measures that maximize accuracy and minimize biases. This is what all other sciences do when they develop measures of theoretical important constructs. It is time for well-being researchers to act like a normal science. To do so, research on well-being reports needs a fresh start and needs an objective and scientific review of the empirical evidence regarding the validity of well-being measures.

Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article: Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach

My response is organized as a commentary on key sections of the article. The sections of the article are direct quotations to give readers quick and easy access to FER’s arguments and conclusions, followed by my comments.  The quotations are printed in bold.

Here, we extend FER2015’s analysis to suggest that much of the discussion of best research practices since 2011 has focused on a single feature of high-quality science—replicability—with insufficient sensitivity to the implications of recommended practices for other features, like discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.

I see replicability as being equivalent to the concept of reliability in psychological measurement.  Reliability is necessary for validity, which means a measure needs to be reliable to produce valid results and this includes internal validity and external validity. And valid results are needed to create a solid body of research that provides the basis for a cumulative science.

Take life-satisfaction judgments as an example. In a review article, Schwarz and Strack (1999) claimed that life-satisfaction judgments are unreliable, extremely sensitive to context, and that responses can change dramatically as a function of characteristic of the survey questions.  Do we think, a measure with low reliability can be used to study well-being and to build a cumulative science of well-being? No. It seems self-evidence that reliable measures are better than unreliable measures.

The reason why some measures are not reliable is that scores on the measure are influenced by factors that are difficult or too expensive to control.  As a result, these factors have an undesirable effect on responses. The effect is not systematic, or it is too difficult to study the systematic effects, and therefore results will randomly change when the same measure is used again and again.  We can assess the influence of these random factors by administering the same measurement procedure again and again and see how much scores change (in the absence of real change).

The same logic applies to replicability.  Replicability means that we get the same result if we repeat a study again and again.  Just like scores on a psychological measure can change, the results of even exact replication studies will not be the same. The reason is the same. Random factors that are outside the control of the experimenter influences the results that are obtained in a single study.  Hence, we cannot expect that exact replication studies will always produce the same results.  For example, the gender ratio in a psychology class will not be the same year after year, even if there is no real change in the gender ratio of psychology students over time.

So what does it even mean for a result to be replicable; that is for a replication study to produce the same result as the original study?  It depends on the interpretation of the results of an original study.  A professor interested in gender composition of psychology could compute the gender ratio for each year. The exact number would vary from year to year.  However, the researcher could also compute a 95% confidence interval around these numbers.  This interval specifies the amount of variability that is expected by chance.  We may then say that a study is replicable if subsequent studies produce results that are compatible with the 95% confidence interval of the original studies.  In contrast, low replicability would mean that results vary from study to study.  For example, in one year the gender ratio is 70% female (+/- 10% 95% CI), in the next year it is 25% female (again +/-10%), and the following year it is 99% (+/- 10%).  In this case, the gender ratio jumps around dramatically and the result from one study cannot be used to predict gender ratios in other years, and provides no solid empirical foundations for theories of the effect of gender on interest in psychology.

Using this criterion of replicability, many results in psychology are highly replicable.  The problem is that using this criterion, many results in psychology are also not very informative because effect sizes tend to be relatively small compared to the width of confidence intervals (Cohen, 1994).  With a standardized effect size of d = .4, and a typical confidence interval width of d ~ 1 (se = .25), the typical finding in psychology is that the effect size ranges from -.1 to d = 0.9. This means the typical result is consistent with a small effect in the opposite direction from the one in the sample (chocolate eating leads to weight gain, even if my study shows that chocolate eating leads to weight loss) and very large effects in the same direction (chocolate eating is a highly effective way of losing weight). Most important, the result is also consistent with the null-hypothesis (chocolate eating has no effect on weight; which in this case would be a sensational and important finding that would make Willy Wonka very happy).  I hope this example makes the point that it is not very informative to conduct studies of small effect sizes with wide confidence intervals because we do not learn much from these studies. Mostly, we are not more informed about a research question after looking at the data than we were before we looked at the data.

Not surprisingly, psychology journals do not publish findings like d = .2 +/- .8.  The typical criterion for reporting a newsworthy result is that the confidence interval falls into one of two regions. The region of effect sizes less than zero or the region of effect sizes greater than zero.  If the 95%CI falls in one of these two regions, it is possible to say that there is only a maximum error rate of 5%, when we infer from a confidence interval in the positive region that the actual effect size is positive, and from a confidence interval in the negative region that the actual effect size is negative.  In other words, it wasn’t just those random factors that produced a positive effect in a sample when the actual effect size is 0 or negative and it wasn’t just random factors that produced a negative effect when the actual effect size is 0 or positive.  To examine whether the results of a study provide sufficient information to claim that an effect is real and not just due to random factors, researchers compute p-values and check whether the p-value is less than 5%.

If the original study, reported a significant result to make inferences about the direction of an effect, and replicability is defined as obtaining the same result, replicability means that we obtain a significant result again in the replication study.  The famous statistician Sir Ronald Fisher made replicability a criterion for a good study. “A properly designed experiment rarely fails to give … significance” (Fisher, 1926, p. 504).

What are the implications of replication studies that do not replicate a significant result?  These studies are often called failed replication studies, but this term is unfortunate because the study was not a failure.  Maybe we might want to call these studies unsuccessful replication studies, although I am not sure this term is much better.  The problem with unsuccessful replication studies is that there are a host of reasons why a replication study might fail.  This means, additional research is needed to uncover why the original study and the replication study produced different results. In contrast, if a series of studies produces significant results, it is highly likely that the result is a real finding and can be used as an empirical foundation for theories.  For example, the gender ratio in my PSY230 course is always significantly different from a 50/50 split that we might expect if both genders were equally interested in psychology. This shows that my study that recorded the gender of students and compared the ratio of men and women against a fixed probability of 50% meets at least one criterion of a properly designed experiment, namely it rarely fails to reject the null-hypothesis.

In short, it is hard to argue with the proposition that replicability is an important criterion for a good study.  If study results cannot be replicated, it is not clear whether a phenomenon exists, and if it is not clear whether a phenomenon exists, it is impossible to make theoretical predictions about other phenomena based on this phenomenon.  For example, we cannot predict gender differences in professions that require a psychology degree if we do not have replicable evidence that there is a gender difference in psychology students.

The present analysis extends FER2015’s “error balance” logic to emphasize tradeoffs among features of a high-quality science (among scientific desiderata). When seeking to optimize the quality of our science, scholars must consider not only how a given research practice influences replicability, but also how it influences other desirable features.

A small group of social relationship researchers (Finkel, Eeastwick, & Reis; henceforce FER) are concerned about the recent shift in psychology from a scientific discipline that ignored replicability entirely to a field that actually cares about the replicability of results published in original research articles.  Although methodologists have criticized psychology for a long time, it was only after Bem (2011) published extraordinarily unbelievable results that psychologists finally started to wonder how replicable published results actually are.  In response to this new focus on replicability, several projects have conducted replication studies with shocking results. In FER’s research area, replicability is estimated to be as low as 25%. That is, three-quarter of published results are not replicable and require further research efforts to examine why original studies and replication studies produced inconsistent results.  In a large-scale replication study, one of the authors original findings failed to replicate and the replication studies cast doubt on theoretical assumptions about the determinants of forgiveness in close relationships.

FER ask “Should Scientists Consistently Prioritize Replicability Above Other Core Features?”

As FER are substantive researchers with little background in research methodology, it may be understandable that they do not mention important contributions by methodologists like Jacob Cohen.  Cohen’s answer is clear.  Less is more, except for sample size.  This statement makes it clear that replicability is necessary for a good study.  According to Cohen a study design can be perfect in many ways (e.g., elaborate experimental manipulation of real-world events with highly valid outcome measures), but if the sample size is small (e.g., N = 3), the study simply cannot produce results that can be used as an empirical foundation for theories.  If a study cannot reject the null-hypothesis with some degree of confidence, it is impossible to say whether there is a real effect or whether the result was just caused by random factors.

Unfazed by their lack of knowledge about research methodology, FER take a different view.

In our view, the field’s discussion of best research practices should revolve around how we prioritize the various features of a high-quality science and how those priorities may shift across our discipline’s many subfields and research contexts.

Similarly, requiring very large sample sizes increases replicability by reducing false-positive rates and increases cumulativeness by reducing false-negative rates, but it also reduces the number of studies that can be run with the available resources, so conceptual replications and real-world extensions may remain unconducted.

So, who is right. Should researchers follow Cohen’s advice and conduct a small number of studies with large samples or is it better to conduct a large number of studies with small samples? If resources are limited and a researcher can collect data from 500 participants in one year.  Should the researcher conduct one study with N = 500, five studies with N = 100, or 25 studies with N = 20?  FER suggest that we have a trade-off between replicability and discoveries.

Also, large sample size norms and requirements may limit the feasibility of certain sorts of research, thereby reducing discovery.

This is true, if we consider true and false discoveries as discoveries (FER do not make a distinction).  Bem (2011) discovered that human minds can time travel. This was a fascinating discovery, yet it was a false discovery. Bem (2001) himself advocated the view that all discoveries are valuable, even false discoveries (Let’s err on the side of discovery.).  Maybe FER learned about research methods from Bem’s chapter.  Most scientists and lay people, however, value true discoveries over false discoveries.  Many people would feel cheated if the Moon landing was actually faked, for example, and if billions spent on cancer drugs are not helping to fight cancer (it really was just eating garlic).  So, the real question is whether many studies with small samples produce more true discoveries than a single study with a large sample.

This question was examined in LeBell, Campbell, and Loving (2015), who concluded largely in favor of Cohen’s recommendation that a slow approach with fewer studies and high replicability is advantageous for a cumulative science.

For example, LCL2016’s Table 3 shows that the N-per-true discovery decreases from N=1,742 when the original research is statistically powered at 25% to N=917 when the original research is statistically powered at 95%.

FER criticize that LCL focused on efficient use of resources for replication studies and ignored the efficient use of resources for original researcher.  As many researchers are often doing more than one study on a particular research question, the distinction between original researcher and replication researcher is artificial. Ultimately, researchers may conduct a number of studies. The studies can be totally new, conceptual replications of previous studies, or exact replications of previous studies. A significant result always will be used to claim a discovery. When a non-significant result contradicts a previous significant result, discovery, additional research is needed to examine whether the original result was a false discovery or whether the replication result was a false negative.

FER observe that “original researchers will be more efficient (smaller N-per-true discovery) when they prioritize lower-powered studies. That is, when assuming that an original researcher wishes to spend her resources efficiently to unearth many true effects, plans never to replicate her own work, and is insensitive to the resources required to replicate her studies, she should run many weakly powered studies.”

FER may have discovered why some researchers, including themselves, pursue a strategy of conducting many studies with relatively low power.  It produces many discoveries that can be published.  They also produce many non-significant results that do not lend to a discovery. But the absolute number of true discoveries is still likely to be greater than the 1 true discovery by a researcher who conducted only one study.  The problem is that the researchers are also likely to make more false discoveries than the researcher who conducts only one study.  They just make more discoveries, true discoveries and false discoveries, and replication studies are needed to examine whether the results are true discoveries or false discoveries.  When other researchers conduct replication studies and fail to replicate an effect, further resources are needed to examine why the original study produced a non-significant result. However, this is not a problem for discoverers who are only in the business of testing new and original hypothesis and reporting those that produced a significant result and leave it to other researchers to examine which of these discoveries are true or false.  These researchers were rewarded handsomely in the years before Bem (2011) because nobody wanted to be in the business of conducting replication studies. As a result, all discoveries produced by original researchers were treated as if they would replicate and researchers with a high number of discoveries were treated as researchers with more true discoveries. There just was no distinction between true and false discoveries and it made sense to err on the side of discovery.

Given the conflicting efficiency goals between original researchers and replicators, whose goals shall we prioritize?

This is a bizarre question.  The goal of science is to uncover the truth and to create theories that rest on a body of replicable, empirical findings.  Apparently, this is not the goal of original researchers.  Their goal is to make as many discoveries as possible and to leave it to replicators to test which of these discoveries are replicable or not.  This division is not very appealing and few scientists want to be the maid of original scientists and clean up their mess when they do cooking experiments in the kitchen.  Original researchers should routinely replicate their own results and when they do so with small studies, they suddenly face the problem of replicators that they end up with non-significant results and now have to conduct further studies to uncover the reasons for these discrepancies.  FER seem to agree.

We must prioritize the field’s efficiency goals rather than either the replicator’s or the original researcher’s in isolation. The solid line in Figure 2 illustrates N-per-true-discovery from the perspective of the field—when the original researcher’s 5,000 participants are added to the pool of participants used by the replicator. This line forms a U-shaped pattern, suggesting that the field will be more efficient (smaller N-per true-discovery) when original researchers prioritize moderately powered studies).

This conclusion is already implied in Cohen’s power calculations.  The reason is that studies with very low power have a low chance of getting a significant result. As a result, resources are wasted on these studies and it would have been better not to conduct these studies, especially when we take into account that each study requires a new ethics approval, training of personal, data analysis time, etc.  All of these costs multiply with the number of studies that are conducted to get a significant result.  At the other extreme, power increases as a log-function of sample size. This means, once power has achieved a certain level, it requires more and more resources to increase power even further. Moreover, 80% power means that 8 out of 10 studies are significant and 90% power means that 9 out of 10 studies are significant. The extra costs of increasing power to 90% may not warrant the increase in success rate from 8 to 9 studies.  For this reason, Cohen did not really suggest that infinite sample sizes are optimal. Instead, he suggested that researchers should aim for 80% power. That is 4 out of 5 studies that examine a real effect show a significant result.

However, FER’s simulations come to a different conclusion.  Their Figure suggests that studies with 30% power are just as good as studies with 70% power and could be even better than studies with 80% power.

For example, if a hypothesis is 75% likely to be true, which might be the case if the finding had a strong theoretical foundation, the most efficient use of field-wide N appears to favor power of ~25% for d=.41 and ~40% for d=.80.

The problem with taking these results seriously is that the criterion N per true discovery does not take into account the costs of a type-I error.  Conducting studies with small samples and low power can produce a larger number of significant results than a smaller sample of studies with large samples, simply due to the larger number of studies. However, it also implies a higher rate of false positives.  Thus, it is important to take the seriousness of a type-I error or a type-II error into account.

So, let’s use a scenario where original results need to be replicated. In fact, many journals require at least two if not more significant results to provide evidence for an effect.  The researcher who conducts many studies with low power has a problem because the probability of obtaining two significant results in a row has only a power-squared probability of getting the desired result.  Even if a single significant result is reported, other researchers need to replicate this finding and many of these replication studies will fail, until eventually a replication study with a significant result corroborates the original finding.

In a simulation with d = .4 and an equal proportion of null-hypothesis and real effects, a researcher with 80% power (N = 200, d = .4, alpha = .05, two-tailed), needs about 900 participants for every discovery.  A researcher with 20% power (N = 40, d = .4, alpha = .05, two-tailed) needs about 1800 participants for every discovery.

When the rate of true null-results decreases, the number of true discoveries increases and it is easier to make true discoveries.  Nevertheless, the advantage of high powered studies remains. It takes about half of the participants for high powered studies to make a true discovery than for low powered studies (N = 665 vs. 1157).

The reason for the discrepancy between my results and FER’s result is that they do not take replicability into account. This is ironic because their title suggest that they are going to write about replicability, when they actually ignore that results from small studies with low power have low replicability. That is, if we only try to get a result once, it can be more efficient to do so with small, underpowered studies because random sampling error will often dramatically inflate effect sizes and produce a significant result. However, this inflation is not replicable and replication studies are likely to produce non-significant results and cast doubt on the original finding.  In other words, they ignore the key characteristic of replicability that replication studies of the same effect should produce significant results again.  Thus, FER’s argument is fundamentally flawed because it ignores the very key concept of replicability. Low powered studies are less replicable and original studies that are not replicable make it impossible to create a cumulative science.

The problems of underpowered studies increase exponentially in a research environment that rewards publication of discoveries, whether they are true or false, and provides no incentives for researches to publish non-significant results, even if these non-significant results challenge the significant results of an original article.  Rather than treating these unsuccessful warning sign that the original results might have been false positives, the non-significant result is treated as evidence that the replication study must have been flawed; after all, the original study found the effect.  Moreover, the replication study might just have low power and the effect exists.  As a result, false positive results can poison theory development because theories have to explain findings that are actually false positives, and researchers continue to conduct unsuccessful replication studies because they are unaware that other researchers have already failed to replicate an original false positive result.  These problems have been discussed at length in the years, but FER blissfully ignore these arguments and discussion.

Since 2011, psychological science has witnessed major changes in its standard operating procedures—changes that hold great promising for bolstering the replicability of our science. We have come a long way, we hope, from the era in which editors routinely encouraged authors to jettison studies or variables with ambiguous results, the file drawer received only passing consideration, and p<.05 was the statistical holy of holies. We remain, as in FER2015, enthusiastic about such changes. Our goal is to work alongside other meta-scientists to generate an empirically grounded, tradeoff-based framework for improving the overall quality of our science.

That sounds good, but it is not clear what FER bring to the table.

We must focus greater attention on establishing which features are most important in a given research context, the extent to which a given research practice influences the alignment of a collective knowledge base with each of the relevant features, and, all things considered, which research practices are optimal in light of the various tradeoffs involved. Such an approach will certainly prioritize replicability, but it will also prioritize other features of a high-quality science, including discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.

What is lacking here is a demonstration that it is possible to prioritize internal validity, external validity, conequentiality, and cumulativeness without replicability. How do we build on results that emerge only in one out of two, three, or five studies, let alone 1 out of 10 studies?  FER create the illusion that we can make more true discoveries by conducting many small studies with low power.  This is true, in the limited sense of needing fewer participants for an initial discovery. But their own criterion of cumulativeness implies that we are not interested in a single finding that may or may not replicate. To build on original findings, others should be able to redo a study and get a significant result again.  This is what Fisher had in mind and what Neyman and Pearson formalized into power analysis.

FER also overlook a much simpler solution to balance the rate of original recovery and replicability.  Namely, researchers an increase the type-I error rate from the conventional 5% criterion to 20% (or more).  As the type-I error rate increases, power increases.  At the same time, readers are properly warned that the results are only suggestive, but definitely require further researcher and cannot be treated as evidence that needs to be incorporated in a theory.  At the same time, researchers with large samples do not have to waste their resources on rejecting H0 with apha = .05 and 99.9% power. They can use their resources to make more definitive statements about their data and reject H0 with a p-value that corresponds to 5 standard deviations of a standard normal (5 sigma rule in particle physics).

No matter what the solution to the replicability crisis in psychology is, the solution cannot be a continuation of the old practice to conduct numerous statistical tests on a small sample and then report only the results that are statistically significant at p < .05.  It is unfortunate that FER’s article can be easily misunderstood as suggesting that using small samples and testing for significance with p < .05 and low power can be a viable research strategy in some specific context.  I think they failed to make their case and to demonstrate in which research context this strategy benefits psychology.

 

 

 

 

 

 

 

SystemFailure

The decline effect in social psychology: Evidence and possible explanations

The decline effect predicts that effects become weaker over time.  It has been proposed as a viable explanation for the replication crisis (Lehrer, 2010).  However, evidence for the decline effect has been elusive (Schooler, 2011).  One major problem, at least in psychology, is that researchers rarely conduct exact replication studies of the original studies.  However, in recent years, psychologists have started to conduct Registered Replication Reports.  An original study is replicated by several labs as closely as possible to the original study.  This makes it possible to examine the decline effect.  The decline effect predicts that original studies have larger effect sizes than replication studies.

One problem is that studies often have small samples and large sampling error.  This makes it difficult to interpret observed effect sizes. One solution to this problem is to focus on the relative extremity of an effect size relative to effect sizes in replication studies.  According to the decline effect, effect sizes in original studies should be higher than effect sizes in replication studies.  In the most extreme case, the original study would have the largest effect size.  If there were 20 studies with identical effect sizes, the probability that the original study reported the strongest effect is only 1/20 = .05.

Method

I ordered all effect sizes from the original study and replication studies in decreasing order of effect sizes. I then recorded the rank of the original study.  R-Code: which(c(1:length(d)) [order(d,decreasing=TRUE)] == 1)# 1 = number of original study.

Results

The results are shown in Table 1. For 5 out of 6 RRRs, the original study reported the largest effect size.  In all of these RRRs, all of the replication studies failed to replicate a significant effect.  Only the second verbal overshadowing RRR produced conclusive evidence for an effect. Yet, the effect size reported in the original study was still the third largest out of 24 studies.  These results provide strong support for the decline effect.

To examine whether this pattern of results could have occurred by chance, I computed the probability of this outcome under the null-hypothesis that all studies have the same population effect size .  The chance of drawing the original study on the first draw is 1/n with n = number of studies.  The probabilities are very low.  For the verbal overshadowing RRR2, the probability of drawing the original study on the third draw is .12 (1 – 23*22*21/(24*23*22)).  A meta-analysis of the six probabilities with Stouffer’s method provides strong evidence against the null-hypothesis, z = 3.8, p < .0001.

Table 1

VerbalOvershadowing RRR1 1 out of 33 p = .03
VerbalOvershadowing RRR2 3 out of 24 p = .12
Ego-depletion: 1 out of 24 p = .04
ImperfectAction 1 out of 13 p = .08
CommitmentForgiveness 1 out of 17 p = .06
Facial Fedback 1 out of 18 p = .06
Combined 1 out of 14,1222 p = 0.00007

Discussion

A test of the decline effect with the data from all Registered Replication Reports provides strong evidence for the hypothesis that effect sizes of original studies are larger and decrease over time.

YThe same holds for ego-depletion.  Initially, performing a difficult task led to a reduction in effort on a second task. But collective consciousness about this effect means that participants are aware of this effect and compensate for it by working harder.  This theory is consistent with the fact that the decline effect is pervasive in social psychology, but not in other sciences. For example, the effect of eating cheesecake on weight gain has unfortunately not decreased as the obesity epidemic shows.  Also computers are getting faster not slower. Thus, not all cause-effect relationships decline over time.

It is only cause-effect relationship of mental processes where collective consciousness can moderate the strength of cause-effect relationships.  Thus, the collective consciousness hypothesis suggests that the replication crisis in psychology is not a replication crisis, but actually a real phenomenon.  The original studies did make a real discovery but ironically the discovery made the effect disappear.

Limitations

This study has a number of limitations and there are alternative explanations for the finding that seminal articles report stronger effect sizes.  One possibility is regression to the mean (Fiedler). Regression to the mean implies that an observed effect size in a small sample will not replicate with the same effect size. The next study is more likely to produce a result that is closer to the mean.  The problem with this hypothesis is that it does not explain why the mean of replication studies is often very close to zero.  Thus, it fails to explain the mysterious disappears of effects and the elusive nature of findings in social psychology that makes the decline effect so interesting.

Another possible explanation is publication bias. Maybe researchers are simply publishing results that are consistent with their theories and they do not publish disconfirming evidence (Sterling, 1959).  However, this explanation does not explain the fact that at the time of the original studies other studies reported successful results.  In fact, many of the RRR studies were taken from articles that reported several successful studies.  The failure to replicate the effect occurred only several years later when there was sufficient time for collective consciousness to make the effect disappear.

Finally, Schooler (personal communication 2012) proposed an interesting theory.  Astrophysicists have calculated that it is very likely that other intelligent live evolved in other parts of the universe way before human evolution.  Like humans now, these intelligent life forms were getting increasingly bored with their limited reality and started building artificially simulated virtual worlds and enjoyed this virtual world to entertain themselves.  At some point, agents in these games were given the illusion of self-consciousness that they are real agents with their own goals, feelings, and thoughts.  According to this theory, we are not real agents, but virtual agents in a computer game of a much more intelligent life form. Although the simulation software works very well, there are some bugs and glitches that make the simulation behave in strange ways. Often the simulated agents do not notice this, but clever experiments by parapsychologists (Bem, 2011) can sometimes reveal these inconsistencies.  Many of the discoveries in social psychology are also caused by these glitches.  The effects can be observed for some time, but then a software update makes them disappear.  This theory would also explain why original results disappeared in replication studies.

Future Research

It is difficult to distinguish empirically between the collective consciousness hypothesis and the simulated-world hypothesis.  However, the two theories make different predictions about findings that do not enter collective consciousness.  A researcher could conduct a study, but not analyze the data, and replicate the study 10 years later. Only then the results of the two studies are analyzed. The collective consciousness hypothesis predicts that there will be no decline effect.  The simulated-world hypothesis predicts that the decline effect will emerge.  Of course, a single original study is most likely to show no effect because it is very difficult to find original effects that are subject to the decline effect.  Thus, it requires many studies that will not show any effect, but when original studies show an effect, it will be very interesting to see whether they replicate. If they do not replicate, it provides evidence for the simulated-world hypothesis that we are just simulated agents in a computer game of a life-form much more intelligent than we think we are.  So, I propose that social psychologists plan a series of carefully planned time-lagged replication studies to answer the most fundamental question of humanity.  Do we really exist because we think we do or is it all a big illusions?

 

 

 

UliPen

Fritz Strack’s self-serving biases in his personal account of the failure to replicate his most famous study.

[please hold pencil (pen does not work) like this while reading this blog post]

In “Sad Face: Another classic finding in psychology—that you can smile your way to happiness—just blew up. Is it time to panic yet?”  b Daniel Engber, Fritz Strack gets to tell his version of the importance of his original study and what it means that it failed to replicate in a recent attempt to replicate his original results in 17 independent replication studies.   In this blog post, I provide my commentary on Fritz Strack’s story to reveal inconsistencies, omissions of important fact, and false arguments to discount the results of the replication studies.

PART I:  Prior to the Replication of Strack et al. (1988)

In 2011, many psychologists lost confidence in social psychology as a science.  One social psychologists had fabricated data at midnight in his kitchen.  Another presented incredible results that people can foresee random events in the future.  And finally, a researcher failed to replicate a famous study where subtle reminders of elderly people made students walk more slowly.  A New Yorker article captured the mood of the time.  It wasn’t clear which findings one should believe and would replicate under close scrutiny?  In response, psychologists created a new initiative to replicate original findings across many independent labs.  A first study produced encouraging results.  Many classic findings in psychology (like the anchoring effect) replicated sometimes even with stronger effect sizes than in the original study.  However, some studies didn’t replicate.  Especially, results from a small group of social psychologists who had built their career around the idea that small manipulations can have strong effects on participants’ behavior without participants’ awareness (such as the elderly priming study) did not replicate well.   The question was which results from this group of social psychologists who study unconscious or implicit processes would replicate?

Quote “The experts were reluctant to step forward. In recent months their field had fallen into scandal and uncertainty: An influential scholar had been outed as a fraud; certain bedrock studies—even so-called “instant classics”—had seemed to shrivel under scrutiny. But the rigidity of the replication process felt a bit like bullying. After all, their work on social priming was delicate by definition: It relied on lab manipulations that had been precisely calibrated to elicit tiny changes in behavior. Even slight adjustments to their setups, or small mistakes made by those with less experience, could set the data all askew. So let’s say another lab—or several other labs—tried and failed to copy their experiments. What would that really prove? Would it lead anyone to change their minds about the science?”

The small group of social psychologist felt under attack.  They had published hundreds of articles and become famous for demonstrating the influence of unconscious processes that by definition were ignored by people when they tried to understand their own behaviors because they operated in secrecy, undetected by conscious introspection.  What if all of their amazing discoveries were not real?  Of course, the researchers were aware that not all studies worked. After all, they often encountered failures to find these effects in their own lab.  It often required several attempts to get the right conditions to produce results that could be published.  If a group of researchers would just go into the lab and do the study once, how would we know that they did everything right. Given ample evidence of failure in their own labs, nobody from this group wanted to step forward and replicate their own study or subject their study to a one-shot test. 

Quote “Then on March 21, Fritz Strack, the psychologist in Wurzburg, sent a message to the guys. “Don’t get me wrong,” he wrote, “but I am not a particularly religious person and I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.” So if the skeptics wanted something to examine—a test case to stand in for all of social-psych research—then let them try his work.”

Fritz Strack was not afraid of failure.  He volunteered his most famous study for a replication project.

Quote “ In 1988, Strack had shown that movements of the face lead to movements of the mind. He’d proved that emotion doesn’t only go from the inside out, as Malcolm Gladwell once described it, but from the outside in.”

It is not exactly clear why Strack picked his 1988 for replication.  The article included two studies. The first study produced a result that is called marginally significant.  That is, it did not meet the standard criterion of evidence, a p-value less than .05 (two-tailed).  But the p-value was very close to .05 and less than .10 (or .05 one-tailed).   This finding alone would not justify great confidence in the replicability of the original finding.  Moreover, a small study with so much noise makes it impossible to estimate the true effect size. The observed effect size in the study was large, but this could have been due to luck (sampling error).  In a replication study, the effect size could be a lot smaller, which would make it difficult to get a significant result in a replication study.

The key finding of this study was that manipulating participants’ facial muscles appeared to influence their feelings of amusement in response to funny cartoons without participants’ awareness that their facial muscles contributed to the intensity of the experience.  This finding made sense in the context of a long tradition of theories that assumed feedback from facial muscles plays an important role in the experience of emotions. 

Strack seemed to be confident that his results would replicate because many other articles also reported results that seemed to support the facial feedback hypothesis.  His study became famous because it used an elaborate cover story to ensure that the effect occurred without participants’ awareness.

Quote: “In lab experiments, facial feedback seemed to have a real effect…But Strack realized that all this prior research shared a fundamental problem: The subjects either knew or could have guessed the point of the experiments. When a psychologist tells you to smile, you sort of know how you’re expected to feel.”

Strack was not the first to do so. 

Quote: “In the 1960s, James Laird, then a graduate student at the University of Rochester, had concocted an elaborate ruse: He told a group of students that he wanted to record the activity of their facial muscles under various conditions, and then he hooked silver cup electrodes to the corners of their mouths, the edges of their jaws, and the space between their eyebrows. The wires from the electrodes plugged into a set of fancy but nonfunctional gizmos… Subjects who had put their faces in frowns gave the cartoons an average rating of 4.4; those who put their faces in smiles judged the same set of cartoons as being funnier—the average jumped to 5.5.”

 

A change by 1.1 points on a rating scale is a huge effect and consistent results across different studies would suggest that the effect can be easily replicated.   The point of Strack’s study was not to demonstrate the effect, but to improve the cover story that made it difficult for participants to guess the real purpose of the study.

“Laird’s subterfuge wasn’t perfect, though. For all his careful posturing, it wasn’t hard for the students to figure out what he was up to. Almost one-fifth of them said they’d figured out that the movements of their facial muscles were related to their emotions. Strack and Martin knew they’d have to be more crafty. At one point on the drive to Mardi Gras, Strack mused that maybe they could use thermometers. He stuck his finger in his mouth to demonstrate.  Martin, who was driving, saw Strack’s lips form into a frown in the rearview mirror. That would be the first condition. Martin had an idea for the second one: They could ask the subjects to hold thermometers—or better, pens—between their teeth. This would be the stroke of genius that produced a classic finding in psychology.”

So in a way, Strack et al.’s study was a conceptual replication study of Laird’s study that used a different manipulation of facial muscles. And the replication study was successful.

“The results matched up with those from Laird’s experiment. The students who were frowning, with their pens balanced on their lips, rated the cartoons at 4.3 on average. The ones who were smiling, with their pens between their teeth, rated them at 5.1. What’s more, not a single subject in the study noticed that her face had been manipulated. If her frown or smile changed her judgment of the cartoons, she’d been totally unaware.”

However, even though the effect size is still large, an .8 difference in ratings, the effect was only marginally significant.  A second study by Strack et al. also produced only a marginally significant results. Thus, we may start to wonder why the researchers were not able to produce stronger evidence for the effect that would produce a significant result at the conventional criterion that is required for claiming a discovery, p < .05 (two-tailed)?   And why did this study become a classic without stronger evidence that the effect is real and that the effect is really as large as the reported effect sizes in these studies.  The effect size may not matter for basic research studies that merely want to demonstrate that the effect exists, but it is important for applications to the real word. If an effect is large under strictly controlled laboratory conditions, the effect is going to be much smaller in real world situations where many of the factors that are controlled in the laboratory also influence emotional experiences.  This might also explain why people normally do not notice the contribution of their facial expressions to their experiences.  Relative to their mood, the funniness of a joke, the presence of others, and a dozen more contextual factors that influence our emotional experiences, feedback from facial muscles may make a very small contribution to emotional experiences.  Strack seems to agree.

Quote “It was theoretically trivial,” says Strack, but his procedure was both clever and revealing, and it seemed to show, once and for all, that facial feedback worked directly on the brain, without the intervention of the conscious mind. Soon he was fielding calls from journalists asking if the pen-in-mouth routine might be used to cure depression. He laughed them off. There are better, stronger interventions, he told them, if you want to make a person happy.”

Strack may have been confident that his study would replicate because other publications used his manipulation and also reported significant results.  And researchers even proposed that the effect is strong enough to have practical implications in the real world.  One study even suggested that controlling facial expressions can reduce prejudice.

Quote: “Strack and Martin’s method would eventually appear in a bewildering array of contexts—and be pushed into the realm of the practical. If facial expressions could influence a person’s mental state, could smiling make them better off, or even cure society’s ills? It seemed so. In 2006, researchers at the University of Chicago showed that you could make people less racist by inducing them to smile—with a pen between their teeth—while they looked at pictures of black faces.”

The result is so robust that replicating it is a piece of cake, a walk in the park, and works even in classroom demonstrations.

“Indeed, the basic finding of Strack’s research—that a facial expression can change your feelings even if you don’t know that you’re making it—has now been reproduced, at least conceptually, many, many times. (Martin likes to replicate it with the students in his intro to psychology class.)”

Finally, Strack may have been wrong when he laughed off questions about curing depression with controlling facial muscles.  Apparently, it is much harder to commit suicide if you put a pen in your mouth to make yourself smile.

Quote: “In recent years, it has even formed the basis for the treatment of mental illness. An idea that Strack himself had scoffed at in the 1980s now is taken very seriously: Several recent, randomized clinical trials found that injecting patients’ faces with Botox to make their “frown lines” go away also helped them to recover from depression.”

So, here you have it. If you ignore publication bias and treat the mountain of confirmatory evidence with a 100% success rate in journals as credible evidence, there is little doubt that the results would replicate. Of course, by the same standard of evidence there is no reason to doubt that other priming studies would replicate, which they did until a group of skeptical researchers tried to replicate the results and failed to do so. 

Quote: “Strack found himself with little doubt about the field. “The direct influence of facial expression on judgment has been demonstrated many, many times,” he told me. “I’m completely convinced.” That’s why he volunteered to help the skeptics in that email chain three years ago. “They wanted to replicate something, so I suggested my facial-feedback study,” he said. “I was confident that they would get results, so I didn’t know how interesting it would be, but OK, if they wanted to do that? It would be fine with me.”

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PART II:  THE REPLICATION STUDY

The replication project was planned by EJ Wagenmakers, who made his name as a critic of research practices in social psychology in response to Bem’s (2011) incredible demonstration of feelings that predict random future events.  Wagenmakers believes that many published results are not credible because the studies failed to test theoretical predictions. Social psychologists would run many studies and publish results when they discovered a significant result with p < .05 (at least one-tailed).  When the results did not become significant the study was considered a failure and not reported.  This practice makes it difficult to predict which results are real and replicate and which results are not real and do not replicate.  Wagenmakers estimated that the facial feedback study had a 30% chance to replicate.

Quote “Personally, I felt that this one actually had a good chance to work,” he said. How good a chance? I gave it a 30-percent shot.” [Come again.  A good chance is 30%?]

A 30% probability may be justified because a replication project by the Open Science Collaborative found that only 25% of social psychological results were successfully replicated.  However, this project used only slightly larger samples than the original studies.  In the replication of the facial feedback hypothesis, 17 labs with larger samples than the original studies and nearly 2000 participants were going to replicate the original study.  The increase in sample size increases the chances of producing a significant result even if the effect size of the original study was vastly inflated.  If a result is not significant with 2,000 participants, it becomes possible to say that the effect may actually not exist or that the effect size is so small to be practically meaningless and definitely have no relevance for the treatment of depression.  Thus, the prediction that there is only a 30% chance of success implies that Wagenmakers was very skeptical about the original results and expected a drastic reduction in the effect size.

Quote “In a sense, he was being optimistic. Replication projects have had a way of turning into train wrecks. When researchers tried to replicate 100 psychology experiments from 2008, they interpreted just 39 of the attempts as successful. In the last few years, Perspectives on Psychological Science has been publishing “Registered Replication Reports,” the gold standard for this type of work, in which lots of different researchers try to re-create a single study so the data from their labs can be combined and analyzed in aggregate. Of the first four of these to be completed, three ended up in failure.”

There were good reasons to be skeptical.  First, the facial feedback theory is controversial. There are two camps in psychology .One camp assumes that emotions are generated in the brain in direct response to cognitive appraisals of the environment. Others have argued that emotional experiences are based on bodily feedback.  The controversy goes back to James versus Cannon and lead to the famous Lazarus-Zajonc debate in the 1980s at the beginning of modern emotion research.  There is also the problem that it is statistically improbable that Strack et al. (1988) would get marginally significant results twice in a row in two independent replications of their study.  Sampling error makes p-values move around and the chance of getting p < .10 and p > .05 twice in a row is slim. This suggests that the evidence was partially obtained with a healthy dose of sampling error and that a replication study would produce weaker effect sizes.

Quote: The work on facial feedback, though, had never been a target for the doubters; no one ever tried to take it down. Remember, Strack’s original study had confirmed (and then extended) a very old idea. His pen-in-mouth procedure worked in other labs.

Strack also had some reasons why the replication project would not produce straight replications of his findings, because he claims that the original study did not produce a huge effect.

Quote “He acknowledged that the evidence from the paper wasn’t overwhelming—the effect he’d gotten wasn’t huge. Still, the main idea had withstood a quarter-century of research, and it hadn’t been disputed in a major, public way. “I am sure some colleagues from the cognitive sciences will manage to come up with a few nonreplications,” he predicted. But he thought the main result would hold.”

But that is wrong.  The study did produce a surprisingly huge effect.  It just didn’t produce strong evidence that this effect was caused by facial feedback rather than problems with the randomized assignment of participants to conditions.  His sample sizes were so small that the large effect was only a bit more than 1.5 times of the standard deviation, which is just enough to claim a discovery with p < .05 one-tailed, but not 2 times of the standard deviation, which is needed to claim a discovery with p < .05 two-tailed.   So, the reported effect size was huge, but the strength of evidence was not.  Taking the reported effect size at face value, one would predict that only every other study would produce a significant result and the other studies would fail to replicate his results.  So even if 17 laboratories would successfully replicate his study and the true effect size was as large as the effect size reported by Strack et al., only half of the labs would be able to claim a successful replication.  As sample sizes were a bit larger in the replication studies, the percentage would be a bit higher, but clearly nobody should expect that all labs individually produce at least marginally significant results.  In fact, it is unlikely that Strack was able to get two significant results in his two reported studies.

After several years of planning, collecting data, and analyzing the data the results were reported.  Not a single lab had produced a significant result. More important, even a combined analysis of data from close to 2,000 participants showed no effect.  The effect size was close to zero.   In other words, there was no evidence that facial feedback had any influence on ratings of amusement in response to cartoons.  This is what researchers call an epic fail.  The study did not just fail in a replication with a smaller sample. It didn’t produce a significant result with a smaller effect size estimate.  The effect just doesn’t appear to be there, although even with 2,000 participants it is not possible to say that the effect is zero.  The results leave a possibility that a very small effect may exist, but an even larger sample would be needed to test this hypothesis. At the same time, the results are not inconsistent with the original results because the original study had so much noise that the population effect size could have been close to zero.   

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PART III: Response to the Replication Failure

We might think that Strack was devastated by the failure to replicate his most famous result that he has produced in his research career.  However, he is rather unmoved by these results.

Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.

This is a bit odd because earlier Strack assured us that he is not religious and trusts the scientific method. “I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.”   So here we have two original studies with weak evidence for an effect and 17 studies with no evidence for the effect and if we combine the information of all 19 studies, we have no evidence for an effect, and to believe in an effect even though 19 studies fail to provide scientific evidence for it seems a bit religious although I would make a distinction between really religious individuals who know that they believe in something and wanna-be-scientists who believe that they know something.  How does Strack justify his belief in an effect that just failed to replicate?  He refers to an article (take-down) by himself that according to his own account shows fundamental problems with the idea that failed replication studies provide meaningful information.  Apparently, only original studies provide meaningful information and when replication studies fail to replicate the results of original studies there must be a problem with the replication studies.

Quote: “Two years ago, while the replication of his work was underway, Strack wrote a takedown of the skeptics’ project with the social psychologist Wolfgang Stroebe. Their piece, called “The Alleged Crisis and the Illusion of Exact Replication,” argued that efforts like the RRR reflect an “epistemological misunderstanding,”

Accodingly, Bem(2011) did successfully demonstrate that humans (at least extraverted humans) can successfully predict random events in the future and learning after an exam can retroactively improve performance on the completed exam.  The fact that replication studies failed to replicate these results only shows an epistemic misunderstanding that we can learn anything from replication studies by skeptics.  So what is the problem with replication studies?

Quote: “Since it’s impossible to make a perfect copy of an old experiment. People change, times change, and cultures change, they said. No social psychologist ever steps in the same river twice. Even if a study could be reproduced, they added, a negative result wouldn’t be that interesting, because it wouldn’t explain why the replication didn’t work.”

We cannot reproduce exactly the same conditions of the original experiment.  But, why is that important.  The same paradigm was allegedly used to reduce prejudice and cure depression, in studies that are wildly different from the original studies.  It worked even then. So, why did it not work when the original study was replicated as closely as possible. And why would we care about a study that worked (marginally) in 92 undergraduate students at the University of Illinois in the 1980s in 2016?  We don’t.  For humans in 2016, the results of a study in 2015 are more relevant. Maybe it worked, may be it didn’t. We will never know, but now we do now that it typically doesn’t work in 2015.  Maybe it will work again in 2017. Who knows. But we cannot claim that there is good support for the facial feedback theory since Darwin came up with it.

But Strack goes further.  When he looks at the results of the replication studies, he does not see what the authors of the replication studies see. 

Quote: “So when Strack looks at the recent data he sees not a total failure but a set of mixed results.”

17 studies all find no effect and all studies are consistent with the hypothesis that there is no effect; the 95% confidence interval includes 0, which is also true for Stracks’ original two studies.”  How can somebody see mixed results in this consistent pattern of results?

Quote:  Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed?

He simply post-hoc divides studies into studies that produced a positive result and studies that produced a negative result. There is no justification for this because none of these studies are individually significantly different from each other and the overall test shows that there is no heterogeneity; that is the results are consistent with the hypothesis that the true population effect size is 0 and that all of the variability in effects across studies is just random noise that is expected from studies with modest sample sizes.

Quote: “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. How could he turn his back on all that evidence?”

And with this final quote, Strack is leaving the realm of scientific discourse and proper interpretation of empirical facts.  He is willing to disregard the results of a scientific test of the facial feedback hypothesis that he initially agreed to.  It is now clear why he agreed to it.  He never considered it a real test of his theory. No matter what the results would be he would maintain his believe in his couple of marginally significant results that are statistically improbable.  Social psychologists have of course studied how humans respond to negative information that challenges their self-esteem and world views.  Unlike facial feedback, the results are robust and not surprising.  Humans are prone to dismiss inconvenient evidence and to construe sometimes ridiculous arguments in order to prop up cherished false beliefs.   As such, Strack’s response to the failure of his most famous article is a successful demonstration that some findings in social psychology are replicable;  it just so happens that Strack’s study is not one of these findings.

Strack comes up with several objections to the replication studies that show his ignorance about the whole project.  For example, he claims that many participants may have guessed the purpose of the study because the study is now a textbook finding.  However, the researchers who conducted the replication studies made sure that the study was conducted before the study was covered in class and some universities do not cover it at all. Moreover, just like Laird, participants who guessed the purpose were excluded.  A lot more participants were excluded because they didn’t hold the pen properly. Of course, this should strengthen the effect because the manipulation should not work when the wrong facial muscles are activated.

Strack even claims that the whole project lacked a research question.

Quote: “Strack had one more concern: “What I really find very deplorable is that this entire replication thing doesn’t have a research question.” It does “not have a specific hypothesis, so it’s very difficult to draw any conclusions,” he told me.”

This makes no sense. Participants were randomly allocated to two conditions and a dependent variable was measured.  The hypothesis was that holding the pen in a way that elicits a smile leads to higher ratings of amusement than holding the pen in a way that leads to a frown.  The empirical question was whether this manipulation would have an effect and this was assessed with a standard test of statistical significance.  The answer was that there was no evidence for the effect.   The research question was the same as in the original study. If this is not a research question than the original study also had no research question. 

And finally, Strack makes the unscientific claim that it simply cannot be true that the reported studies all got it wrong.

Quote: The RRR provides no coherent argument, he said, against the vast array of research, conducted over several decades, that supports his original conclusion. “You cannot say these [earlier] studies are all p-hacked,” Strack continued, referring to the battery of ways in which scientists can nudge statistics so they work out in their favor. “You have to look at them and argue why they did not get it right.”

Scientific journals select studies that produced significant results. As a result, all prior studies were published because they produced a significant (or at least marginally significant) result.  Given the selectin for significance, there is no error control.  The number of successful replications in the published literature tells us nothing about the truth of a finding.  We do not have to claim that all studies were p-hacked. We can just say all studies were selected to be significant and that is true and well known.  As a result, we do not know which results will replicate until we have conducted replication studies and do not select for significance. This is what the RRR did. As a result, it provides the first unbiased and real empirical test of the facial feedback hypothesis and it failed. That is science. Ignoring it is not.

Closer inspection of the original article by Daniel Engber shows further problems.  

Quote: For the second version, Strack added a new twist. Now the students would have to answer two questions instead of one: First, how funny was the cartoon, and second, how amused did it make them feel? This was meant to help them separate their objective judgments of the cartoons’ humor from their emotional reactions. When the students answered the first question—“how funny is it?,” the same one that was used for Study 1—it looked as though the effect had disappeared. Now the frowners gave the higher ratings, by 0.17 points. If the facial feedback worked, it was only on the second question, “how amused do you feel?” There, the smilers scored a full point higher. (For the RRR, Wagenmakers and the others paired this latter question with the setup from the first experiment.) In effect, Strack had turned up evidence that directly contradicted the earlier result: Using the same pen-in-mouth routine, and asking the same question of the students, he’d arrived at the opposite answer. Wasn’t that a failed replication, or something like it?”

Strack dismisses this concern as well, but Daniel Engber is not convinced.

Quote:  “Strack didn’t think so. The paper that he wrote with Martin called it a success: “Study 1’s findings … were replicated in Study 2.”… That made sense, sort of. But with the benefit of hindsight—or one could say, its bias—Study 2 looks like a warning sign. This foundational study in psychology contained at least some hairline cracks. It hinted at its own instability. Why didn’t someone notice?

And nobody else should be convinced.  Fritz Strack is a prototypical example of a small group of social psychologists that has ruined social psychology by engaging in a game of publishing results that were consistent with theories of strong and powerful effects of stimuli on people’s behavior outside their awareness.  These results were attention-grabbing just like annual returns of 20% would be eye-catching returns.  Many people invested in these claims on the basis of flimsy evidence that doesn’t even withstand scrutiny by a science journalist.  And to be clear, only a few of them did go as far to fabricate data. But many others fabricated facts by publishing only studies that supported their claims while hiding evidence from studies that failed to show the effect.  Now we see what happens when these claims are subjected to real empirical tests that can succeed or fail. Many of the fail.  For future generations it is not important why they did what they did and how they feel about it now. What is important is that we realize that many results in textbooks are not based on solid evidence and social psychology needs to change the way they conduct research if it wants to become a real science that builds on empirically verifiable facts.  Strack’s response to the RRR is what it is a defensive reaction to evidence that his famous article was based on a false positive result. 

How Can We Interpret Inferences with Bayesian Hypothesis Tests?

SUMMARY

In this blog post I show how it is possible to translate the results of a Bayesian Hypothesis Test into an equivalent frequentist statistical test that follows Neyman Pearsons approach of hypthesis testing where hypotheses are specified as ranges of effect sizes (critical regions) and observed effect sizes are used to make inferences about population effect sizes with long-run error rates.

INTRODUCTION

The blog post also explains why it is misleading to consider Bayes Factors that favor the null-hypothesis (d = 0) over an alternative hypothesis (e.g., Jeffrey’s prior) as evidence for the absence of an effect.  This conclusion is only warranted with infinite sample sizes, but with finite sample sizes, especially small sample sizes that are typical in psychology,  Bayes Factors in favor of H0 can only be interpreted as evidence that the population effect size is close to zero, but not as evidence that the population effect size is exactly zero.  How close the effect sizes that are consistent with H0 are depends on sample size and the criterion value that is used to interpret the results of a study as sufficient evidence for H0.

One problem with Bayes Factors is that like p-values, they are a continuous measure of likelihoods, just like p-values are a continuous measure of probabilities, and the observed value is not sufficient to justify an inference or interpretation of the data. This is why psychologists moved from Fisher’s approach to Neyman Pearson’s approach that compared an observed p-value to a specified (by convention or pre-registertation) criterion value. For p-values this is alpha. If p < alpha, we reject H0:d = 0 in favor of H1, there was a (positive or negative) effect.

Most researchers interpret Bayes Factors relative to some criterion value (e.g., BF > 3 or BF > 5 or BF > 10). These criterion values are just as arbitrary as the .05 criterion for p-values and the only justification for these values that I have seen is that (Jeffrey who invented Bayes Factors said so). There is nothing wrong with a conventional criterion value, even if Bayesian’s think there is something wrong with p < .05, but use BF > 3 in just the same way, but it is important to understand the implications of using a particular criterion value for an inference. In NHST the criterion value has a clear meaning. It means that in the long-run, the rate of false inferences (deciding in favor of H1 when H1 is false) will not be higher than the criterion value.  With alpha = .05 as a conventional criterion, a research community decided that it is ok to have a maximum 5% error rate.  Unlike, p-values, criterion values for Bayes-Factors provide no information about error rates.  The best way to understand what a Bayes-Factor of 3 means is that we can assume that H0 and H1 are equally probable before we conduct a study and a Bayes Factor of 3 in favor of H0 makes it 3 times more likely that H0 is true than that H1 is true. If we were gambling on results and the truth were known, we would increase our winning odds from 50:50 to 75:25.   With a Bayes-Factor of 5, the winning odds increase to 80:20.

HYPOTHESIS TESTING VERSUS EFFECT SIZE ESTIMATION

p-values and BF also share another shortcoming. Namely they provide information about the data given a hypothesis or two hypotheses, but they do not provide information about the data. We all know that we should not report results as “X influenced Y, p < .05”. The reason is that this statement provides no information about the effect size.  The effect size could be tiny, d = 0.02, small, d = .20, or larger, d = .80.  Thus, it is now required to provide some information about raw or standardized effect sizes and ideally also about the amount of  raw or standardized sampling error. For example, standardized effect sizes could be reported as the standardized mean difference and sampling error (d = .3, se = .15) or as a  confidence interval, e.g., (d = .3, 95% CI = 0 to .6). This is important information about the actual data, but it does not provide information about hypothesis tests. Thus, if the results of a study are used to test hypothesis, information about effect sizes and sampling errors has to be evaluated with specified criterion values that can be used to examine which hypothesis is consistent with an observed effect size.

RELATING HYPOTHESIS TESTS TO EFFECT SIZE ESTIMATION

In NHST, it is easy to see how p-values are related to effect size estimation.  A confidence interval around the observed effect size is constructed by multiplying the amount of sampling error by  a factor that is defined by alpha.  The 95% confidence interval covers all values around the observed effect size, except the most extreme 5% values in the tails of the sampling distribution.  It follows that any significance test that compares the observed effect size against a value outside the confidence interval will produce a p-value less than the error criterion.

It is not so straightforward to see how Bayes Factors relate to effect size estimates.  Rouder et al. (2016) discuss a scenario where the 95% credibiltiy interval ranges around the most likey effect size of d = .165 ranges from .055 to .275 and excludes zero.  Thus, an evaluation of the null-hypothesis, d = 0, in terms of a 95%CI would lead to the rejection of the point-zero hypothesis.  We cannot conclude from this evidence that an effect is absent. Rather the most reasonable inference is that the population effect size is likely to be small, d ~ .2.   In this scenario, Rouder et al. (2009) obtained a Bayes-Factor of 1.  This Bayes-Factor also does not support H0, but it also does not provide support for H1.  How is it possible that two Bayesian methods seem to produce contradictory results? One method rejects H0:d = 0 and the other method shows no more support for H1 than for H0:d = 0.

Rouder et al. provide no answer to this question.  “Here we have a divergence. By using posterior credible intervals, we might reject the null, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point” (p. 536).   Moreover, they suggest that Bayes Factors give the correct answer and the rejection of d = 0 by means of credibility intervals is unwarranted. “…, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point.Updating with Bayes’ rule directly is the correct approach because it describes appropriate conditioning of belief about the null point on all the information in the data” (p. 536).

The problem with this interpretation of the discrepancy is that Rouder et al. (2009) misinterpret the meaning of a Bayes Factor as if it can be directly interpreted as a test of the null-hypothesis, d = 0.  However, in more thoughtful articles by the same authors, they recognize that (a) Bayes Factors only provide relative information about H0 in comparison to a specific alternative hypothesis H1, (b) the specification of H1 influences Bayes Factors, (c) alternative hypotheses that give a high a priori probability to large effect sizes favor H0 when the observed effect size is small, and (d) it is always possible to specify an alternative hypothesis (H1) that will not favor H0 by limiting the range of effect sizes to small effect sizes. For example, even with a small observed effect size of d = .165, it is possible to provide strong support for H1 and reject H0, if H1 is specified as Cauchy(0,0.1) and the sample size is sufficiently large to test H0 against H1.

BF.N.r.Plot.png
Figure 1 shows how Bayes Factors vary as a function of the specification of H1 and as a function of sample size with the same observed effect size of d = .165.  It is possible to get Bayes-Factors greater than 3 in favor of H0 with a wide Cauchy (0,1) and a small sample size of N = 100 and a Bayes Factor greater than 3 in favor of H1 with a small scaling factor of .4 or smaller and a sample size of N = 250.  In short, it is not possible to interpret Bayes Factors that favor H0 as evidence for the absence of an effect.  The Bayes Factor only tells us that the observed effect size is more consistent with the data than H1, but it is difficult to interpret this result because H1 is not a clearly specified alternative effect size. H1 changes not only with the specification of the range of effect sizes, but also with sample size.  This property is not a design flaw of Bayes Factors.  They were designed to provide more and more stringent tests of H0:d = 0 that would eventually support H1 if the sample size is sufficiently large and H0:d = 0 is false.  However, if H0 is false and H1 includes many large effect sizes (an ultrawide prior), Bayes Factors will first favor H0 and data collection may stop before Bayes Factors switch and provide the correct result that the population effect size is not zero.   This behavior of Bayes-Factors was illustrated by Rouder et al. (2009) with a simulation of a population effect size of d = .02.

 

BFSmallEffect.png
Here we see that the Bayes Factor favors H0 until sample sizes are above N = 5,000 and provides the correct information about the point hypothesis being false with N = 20,000 or more.To avoid confusion in the interpretation of Bayes Factors and to provide a better understanding of the actual regions of effect sizes that are consistent with H0 and H1, I developed simple R-Code that translates the results of a Bayesian Hypothesis Test into a Neyman Pearson hypothesis test.

TRANSLATING RESULTS FROM A BAYESIAN HYPOTHESIS TEST INTO RESULTS FROM A NEYMAN PEARSON HYPOTHESIS TEST

A typical analysis with BF creates three regions. One region of observed effect sizes is defined by BF > BF.crit in favor of H1 over H0. One region is defined by inconclusive BF with BF < BF.crit in favor of H0 and BF < BF.crit for H1 (1/BF crit < BF(H1/H0) < BF.crit.). The third region is defined by effect sizes between 0 and the effect size that matches the criterion for BF > BF.crit in favor of H0.
The width and location of these regions depends on the specification of H1 (a wider or narrower distribution of effect sizes under the assumption that an effect is present), the sample size, and the long-run error rate, where an error is defined as a BF > BF.crit that supports H0 when H1 is true and vice versa.
I examined the properties of BF for two scenarios. In one scenario researchers specify H1 as a Cauchy(0,.4). The value of .4 was chosen because .4 is a reasonable estimate of the median effect size in psychological research. I chose a criterion value of BF.crit = 5 to maintain a relatively low error rate.
I used a one sample t-test with n = 25, 100, 200, 500, and 1,000. The same amount of sampling error would be obtained in a two-sample design with 4x the sample size (N = 100, 400, 800, 2,000, and 4,000).
bf.crit N bf0 ci.low border ci.high alpha
[1,] 5 25 2.974385 NA NA 0.557 NA
[2,] 5 100 5.296013 0.035 0.1535 0.272 0.1194271
[3,] 5 200 7.299299 0.063 0.1300 0.197 0.1722607
[4,] 5 500 11.346805 0.057 0.0930 0.129 0.2106060
[5,] 5 1000 15.951191 0.048 0.0715 0.095 0.2287873
We see that the typical sample size in cognitive psychology with a within-subject design (n = 25) will never produce a result in favor of H0 and it requires an effect size of d = .56 to produce a result in favor of H1. This criterion is somewhat higher than the criterion effect size for p < .05 (two-tailed), which is d = .41, and approximately the same as the effect size needed for with alpha = .01, d = .56.
With N = 100, it is possible to obtain evidence for H0. If the observed effect size is exactly 0, BF = 5.296, and the maximum observed effect size that produces evidence in favor of H0 is d = 0.035. The minimum effect size needed to support H1 is d = .272. We can think about these two criterion values as limits of a confidence interval around the effect size in the middle (d = .1535). The width of the confidence interval implies that in the long run, we would make ~ 11% errors in favor of H0 and 11% errors in favor of H1, if the population effect size is d = .1535(#1). If we treat d = .1535 as the boundary for an interval null-hypothesis, H0:abs(d) < .1535, we do not make a mistake when the population effect size is less than .1535. So, we can interpret a BF > 5 as evidence for H0:abs(d) < .15, with an 11% error rate. The probability of supporting H0 with a statistically small effect size of d = .2 would be less than 11%. In short, we can interpret BF > 5 in favor of H0 as evidence for abs(d) < .15 and BF > 5 in favor of H1 as evidence for H1:abs(d) > .15, with approximate error rates of 10% and a region of inconclusive evidence for observed effect sizes between d = .035 and d = .272.
The results for N = 200, 500, and 1,000 can be interpreted the same way. An increase in sample size has the following effects: (a) the boundary effect size d.b that separates H0:|d| <= d.b and H1:|d| > d.b shrinks. In the limit it reaches zero and only d = 0 supports H0: |d| <= 0. With N = 1,000, the boundary value is d.b = .048 and an observed effect size of d = .0715 provides sufficient evidence for H1. However, the table also shows that the error rate increases. In larger samples a BF of 5 in one direction or the other occurs more easily by chance and the long-term error rate has doubled. Of course, researchers could keep a fixed error rate by adjusting the BF criterion value to match a fixed error rate, but Bayesian Hypthesis tests are not designed to maintain a fixed error rate. If this were a researchers goal, they could just specify alpha and use NHST to test H0:|d| < d.crit vs. H1:|d| > d.crit.
In practice, many researchers use a wider prior and a lower criterion value. For example, EJ Wagenmakers prefers the original Jeffrey prior with a scaling factor of 1 and a criterion value of 3 as noteworthy (but not definitive) evidence.
The next table translates inferences with a Cauchy(0,1) and BF.crit = 3 into effect size regions.
bf.crit N bf0 ci.low border ci.high alpha
[1,] 3 25 6.500319 0.256 0.3925 0.529 0.2507289
[2,] 3 100 12.656083 0.171 0.2240 0.277 0.2986493
[3,] 3 200 17.812296 0.134 0.1680 0.202 0.3155818
[4,] 3 500 28.080784 0.094 0.1140 0.134 0.3274574
[5,] 3 1000 39.672827 0.071 0.0850 0.099 0.3290325

The main effect of using Cauchy(0,1) to specify H1 is that the border value that distinguishes H0 and H1 is higher. The main effect of using BF.crit = 3 as a criterion value is that it is easier to provide evidence for H0 or H1 at the expense of having a higher error rate.

It is now possible to provide evidence for H0 with a small sample of n = 25 in a one-sample t-test. However, when we translate this finding into ranges of effect sizes, we see that the boundary between H0 and H1 is d = .39.  Any observed effect size below .256 yields a BF in favor of H0. So, it would be misleading to interpret this finding as if a BF of 3 in a sample of n = 25 provides evidence for the point null-hypothesis d = 0.  It only shows that an effect size of d < .39 is more consistent with an effect size of 0 than with effect sizes specified in H1 which places a lot of weight on large effect sizes.  As sample sizes increase, the meaning of BF > 3 in favor of H0 changes. With N = 1,000,  a BF of 3  any effect size larger than .071 does no longer provide evidence for H0.  In the limit with an infinite sample size, only d = 0 would provide evidence for H0 and we can infer that H0 is true. However, BF > 3 in finite sample sizes does not justify this inference.

The translation of BF results into hypotheses about effect size regions makes it clear why BF results in small samples often seem to diverge from hypothesis tests with confidence intervals or credibility intervals.  In small samples, BF are sensitive to specification of H1 and even if it is unlikely that the population effect size is 0 (0 is outside the confidence or credibility interval), the BF may show support for H0 because the effect size is below the criterion value that is needed to support H0.  This inconsistency does not mean that different statistical procedures lead to different inferences. It only means that BF > 3 in favor of H0 RELATIVE TO H1 cannot be interpreted as a test of the hypothesis of d = 0.  It can only be interpreted as evidence for H0 relative to H1 and the specification of H1 influences  which effect sizes provide support for H0.

CONCLUSION

Sir Arthur Eddington (cited by Cacioppo & Berntson, 1994) described a hypothetical
scientist who sought to determine the size of the various fish in the sea. The scientist began by weaving a 2-in. mesh net and setting sail across the seas. repeatedly sampling catches and carefully measuring. recording. and analyzing the results of each catch. After extensive sampling. the scientist concluded that there were no fish smaller than 2 in. in the sea.

The moral of this story is that a scientists method influences their results.  Scientists who use p-values to search for significant results in small samples, will rarely discover small effects and may start to believe that most effects are large.  Similarly, scientists who use Bayes-Factors with wide priors may delude themselves that they are searching for small and large effects and falsely believe that effects are either absent or large.  In both cases, scientists make the same mistake.  A small sample is like a net with large holes that can only (reliably) capture big fish.  This is ok, if the goal is to capture only big fish, but it is a problem when the goal is to find out whether a pond contains any fish at all.  A wide net with big holes may never lead to the discovery of a fish in the pond, while there are plenty of small fish in the pond.

Researchers therefore have to be careful when they interpret a Bayes Factor and they should not interpret Bayes-Factors in favor of H0 as evidence for the absence of an effect. This fallacy is just as problematic as the fallacy to interpret a p-value above alpha (p > .05) as evidence for the absence of an effect.  Most researchers are aware that non-significant results do not justify the inference that the population effect size is zero. It may be news to some that a Bayes Factor in favor of H0 suffers from the same problem.  A Bayes-Factor in favor of H0 is better considered a finding that rejects the specific alternative hypothesis that was pitted against d = 0.  Falsification of this specific H1 does not justify the inference that H0:d = 0 is true.  Another model that was not tested could still fit the data better than H0.

Bayes Ratios: A Principled Approach to Bayesian Hypothesis Testing

 

This post is a stub that will be expanded and eventually be turned into a manuscript for publication.

 

I have written a few posts before that are critical of Bayesian Hypothesis Testing with Bayes Factors (Rouder et al.,. 2009; Wagenmakers et al., 2010, 2011).

The main problem with this approach is that it typically compares a single effect size (typically 0) with an alternative hypothesis that is a composite of all other effect sizes. The alternative is often specified as a weighted average with a Cauchy distribution to weight effect sizes.  This leads to a comparison of H0:d=0 vs. H1:d=Cauchy(es,0,r) with r being a scaling factor that specifies the median absolute effect size for the alternative hypothesis.

It is well recognized by critics and proponents of this test that the comparison of H0 and H1 favors H0 more and more as the scaling factor is increased.  This makes the test sensitive to the specification of H1.

Another problem is that Bayesian hypothesis testing either uses arbitrary cutoff values (BF > 3) to interpret the results of a study or asks readers to specify their own prior odds of H0 and H1.  I have started to criticize this approach because the use of a subjective prior in combination with an objective specification of the alternative hypothesis can lead to false conclusions.  If I compare H0:d = 0 with H1:d = .2, I am comparing two hypothesis with a single value.  If I am very uncertain about the results of a study , I can assign an equal prior probability to both effect sizes and the prior odds of H0/H1 are .5/.5 = 1. Thus, a Bayes Factor can be directly interpreted as the posterior odds of H0 and H1 given the data.

Bayes Ratio (H0/H1) = Prior Odds (H0,H1) * Bayes Factor (H0/H1)

However, if I increase the range of possible effect sizes for H1 and I am uncertain about the actual effect sizes, the a priori probability increases, just like my odds of winning increases when I disperse my bet on several possible outcomes (lottery numbers, horses in the Kentucky derby, or numbers in a roulette game).  Betting on effect sizes is no different and the prior odds in favor of H1 increase the more effect sizes I consider plausible.

I therefore propose to use the prior distribution of effect sizes to specify my uncertainty about what could happen in a study. If I think, the null-hypothesis is most likely, I can weight it more than other effect sizes (e.g., with a Cauchy or normal distribution centered at 0).   I can then use this distribution to compute (a) the prior odds of H0 and H1, and (b) the conditional probabilities of the observed test statistic (e.g., a t-value) given H0 and H1.

Instead of interpreting Bayes Factors directly, which is not Bayesian, and confuses conditional probabilities of data given hypothesis with conditional probabilities of hypotheses given data,  Bayes-Factors are multiplied with the prior odds, to get Bayes Ratios, which many Bayesians consider to be the answer to the real question researchers want to answer.  How much should I believe H0 or H1 after I collected data and computed a test-statistic like a t-value?

This approach is more principled and Bayesian than the use of Bayes Factors with arbitrary cut-off values that are easily misinterpreted as evidence for H0 or H1.

One reason why this approach may not have been used before is that H0 is often specified as a point-value (d = 0) and the a priori probability of a single point effect size is 0.  Thus, the prior odds (H0/H1) are zero and the Bayes Ratio is also zero.  This problem can be avoided by restricting H1 to a reasonably small range of effect sizes and by specifying the null-hypothesis as a small range of effect sizes around zero.  As a result, it becomes possible to obtain non-zero prior odds for H0 and to obtain interpretable Bayes Ratios.

The inferences based on Bayes Ratios are not only more principled than those based on Base Factors,  they are also more in line with inferences that one would draw on the basis of other methods that can be used to test H0 and H1 such as confidence intervals or Bayesian credibility intervals.

For example, imagine a researcher who wants to provide evidence for the null-hypothesis that there are no gender differences in intelligence.   The researcher decided a priori that small differences of less than 1.5 IQ points (0.1 SD) will be considered as sufficient to support the null-hypothesis. He collects data from 50 men and 50 women and finds a mean difference of 3 IQ points in one or the other direction (conveniently, it doesn’t matter in which direction).

The t-value with a standardized mean difference of d = 3/15d = .2, and sampling error of SE = 2/sqrt(100) = .2 is t = .2/2 = 1.  A t-value of 1 is not statistically significant. Thus, it is clear that the data do not provide evidence against H0 that there are no gender differences in intelligence.  However, do the data provide positive sufficient evidence for the null-hypothesis?   p-values are not designed to answer this question.  The 95%CI around the observed standardized effect size is -.19 to .59.  This confidence interval is wide. It includes 0, but it also includes d = .2 (a small effect size) and d = .5 (a moderate effect size), which would translate into a difference by 7.5 IQ points.  Based on this finding it would be questionable to interpret the data as support for the null-hypothesis.

With a default specification of the alternative hypothesis with a Cauchy distribution scaled to 1,  the Bayes-Factor (H0/H1) favors H0 over H1  4.95:1.   The most appropriate interpretation of this finding is that the prior odds should be updated by a factor of 5:1 in favor of H0, whatever these prior odds are.  However, following Jeffrey’s many users who compute Bayes-Factors interpret Bayes-Factors directly with reference to Jeffrey’s criterion values and a value greater than 3 can be and has been used to suggest that the data provide support for the null-hypothesis.

This interpretation ignores that the a priori distribution of effect sizes allocates only a small probability (p = .07) to H0 and a much larger area to H1 (p = .93).  When the Bayes Factor is combined with the prior odds (H0/H1) of .07/.93 = .075/1,   the resulting Bayes Ratio shows that support for H0 increased, but that it is still more likely that H1 is true than that H0 is true,   .075 * 4.95 = .37.   This conclusion is consistent with the finding that the 95%CI overlaps with the region of effect sizes for H0 (d = -.1, .1).

We can increase the prior odds of H0 by restricting the range of effect sizes that are plausible under H1.  For example, we can restrict effect sizes to 1 or we can set the scaling parameter of the Cauchy distribution to .5. This way, 50% of the distribution falls into the range between d = -.5 and .5.

The t-value and 95%CI remain unchanged because they do not require a specification of H1.  By cutting the range of effect sizes for H1 roughly in half (from scaling parameter 1 to .5), the Bayes-Factor in favor of H0 is also cut roughly in half and is no longer above the criterion value of 3, BF (H0/H1) = 2.88.

The change of the alternative hypothesis has the opposite effect on prior odds. The probability of H0 nearly doubled (p = .13) and the prior odds are now .13/.87 = .15.  The resulting Bayes Ratio in favor of H0 remains similar to the Bayes Ratio with the wider Cauchy distribution, Bayes Ratio = .15 * 2.88 = 0.45.  In fact, it actually is a bit stronger than the Bayes Ratio with the wider specification of effect sizes (BR (H0/H1) = .45.  However, both Bayes Ratios lead to the same conclusion that is also consistent with the observed effect size, d = .2, and the confidence interval around it, d = -.19 to d = .59.  That is, given the small sample size, the observed effect size provides insufficient information to draw any firm conclusions about H0 or H1. More data are required to decide empirically which hypothesis is more likely to be true.

The example used an arbitrary observed effect size of d = .2.  Evidently, effect sizes much larger than this would lead to the rejection of H0 with p-values, confidence intervals, Bayes Factor, or Bayes-Ratios.  A more interesting question is what the results would be like if the observed effect size would have provided maximum support for the null-hypothesis, which assumes an observed effect size of 0, which also produces a t-value of 0.   With the default prior of Cauchy(M=0,V=1), the Bayes-Factor in favor of H0 is 9.42, which is close to the next criterion value of BF > 10 that is sometimes used to stop data collection because the results are decisive.  However, the Bayes Ratio is still slightly in favor of H1, BR (H1/H0) = 1.42.  The 95%CI ranges from -.39 to .39 and overlaps with the criterion range of effect sizes in the range from -.1 to .1.   Thus, the Bayes Ratio shows that even an observed effect size of 0 in a sample of N = 100 provides insufficient evidence to infer that the null-hypothesis is true.

When we increase sample size to N = 2,000,  the 95%CI around d = 0 ranges from -.09 to .09.  This finding means that the data support the null-hypothesis and that we would make a mistake in our inferences that use the same approach in no more than 5% of our tests (not just those that provide evidence for H0, but all tests that use this approach).  The Bayes-Factor also favors H0 with a massive BF (H0/H1) = 711..27.   The Bayes-Ratio also favors H0, with a Bayes-Ratio of 53.35.   As Bayes-Ratios are the ratio of two complementary probabilities p(H0) + p(H1) = 1, we can compute the probability of H0 being true with the formula  BR(H0/H1) / (Br(H0/H1) + 1), which yields a probability of 98%.  We see how the Bayes-Ratio is consistent with the information provided by the confidence interval.  The long-run error frequency for inferring H0 from the data was less than 5% and the probability of H1 being true given the data is 1-.98 = .02.

Conclusion

Bayesian Hypothesis Testing has received increased interest among empirical psychologists, especially in situations when researchers aim to demonstrate the lack of an effect.  Increasingly, researchers use Bayes-Factors with criterion values to claim that their data provide evidence for the null-hypothesis.  This is wrong for three reasons.

First, it is impossible to test a hypothesis that is specified as one effect size out of an infinite number of alternative effect sizes.  Researchers appear to be confused that Bayes Factors in favor of H0 can be used to suggest that all other effect sizes are implausible. This is not the case because Bayes Factors do not compare H0 to all other effect sizes. They compare H0 to a composite hypotheses of all other effect sizes and Bayes Factors depend on the way the composite is created. Falsification of one composite does not ensure that the null-hypothesis is true (the only viable hypothesis still standing) because other composites can still fit the data better than H0.

Second, the use of Bayes-Factors with criterion values also suffers from the problem that it ignores the a priori odds of H0 and H1.  A full Bayesian inferences requires to take the prior odds into account and to compute posterior odds or Bayes Ratios.  The problem for the point-null hypothesis (d = 0) is that the prior odds for H0 over H1 is 0. The reason is that the prior distribution of effect sizes adds up to 1 (the true effect size has to be somewhere), leaving zero probability for d = 0.   It is possible to compute Bayes-Factors for d = 0 because Bayes-Factors use densities. For the computation of Bayes Factors the distinction between densities and probabilities is not important, but the for the computation of prior odds, the distinction is important.  A single effect size has a density on the Cauchy distribution, but it has zero probability.

The fundamental inferential problem of Bayes-Factors that compare H0:d=0 can be avoided by specifying H0 as a critical region around d=0.  It is then possible to compute prior odds based on the area under the curve for H0 and the area under the curve for H1. It is also possible to compute Bayes Factors for H0 and H1 when H0 and H1 are specified as complementary regions of effect sizes.  The two ratios can be multiplied to obtain a Bayes Ratio. Furthermore, Bayes Ratios can be used as the probability of H0 given the data and the probability of H1 given the data.  The results of this test are consistent with other approaches to the testing of regional null-hypothesis and they are robust to misspecifications of the alternative hypothesis that allocate to much weight to large effect sizes.   Thus, I recommend Bayes Ratios for principled Bayesian Hypothesis testing.

 

*************************************************************************

R-Code for the analyses reported in this post.

*************************************************************************

#######################
### set input
#######################

### What is the total sample size?
N = 2000

### How many groups?  One sample or two sample?
gr = 2

### what is the observed effect size
obs.es = 0

### Set the range for H0, H1 is defined as all other effect sizes outside this range
H0.range = c(-.1,.1)  #c(-.2,.2) # 0 for classic point null

### What is the limit for maximum effect size, d = 14 = r = .99
limit = 14

### What is the mode of the a priori distribution of effect sizes?
mode = 0

### What is the variability (SD for normal, scaling parameter for Cauchy) of the a priori distribution of effect sizes?
var = 1

### What is the shape of the a priori distribution of effect sizes
shape = “Cauchy”  # Uniform, Normal, Cauchy  Uniform needs limit

### End of Input
### R computes Likelihood ratios and Weighted Mean Likelihood Ratio (Bayes Factor)
prec = 100 #set precision, 100 is sufficient for 2 decimal
df = N-gr
se = gr/sqrt(N)
pop.es = mode
if (var > 0) pop.es = seq(-limit*prec,limit*prec)/prec
weights = 1
if (var > 0 & shape == “Cauchy”) weights = dcauchy(pop.es,mode,var)
if (var > 0 & shape == “Normal”) weights = dnorm(pop.es,mode,var)
if (var > 0 & shape == “Uniform”) weights = dunif(pop.es,-limit,limit)
H0.mat = cbind(0,1)
H1.mat = cbind(mode,1)
if (var > 0) H0.mat = cbind(pop.es,weights)[pop.es >= H0.range[1] & pop.es <= H0.range[2],]
if (var > 0) H1.mat = cbind(pop.es,weights)[pop.es < H0.range[1] | pop.es > H0.range[2],]
H0.mat = matrix(H0.mat,,2)
H1.mat = matrix(H1.mat,,2)
H0 = sum(dt(obs.es/se,df,H0.mat[,1]/se)*H0.mat[,2])/sum(H0.mat[,2])
H1 = sum(dt(obs.es/se,df,H1.mat[,1]/se)*H1.mat[,2])/sum(H1.mat[,2])
BF10 = H1/H0
BF01 = H0/H1
Pr.H0 = sum(H0.mat[,2]) / sum(weights)
Pr.H1 = sum(H1.mat[,2]) / sum(weights)
PriorOdds = Pr.H1/Pr.H0
Bayes.Ratio10 = PriorOdds*BF10
Bayes.Ratio01 = 1/Bayes.Ratio10
### R creates output file
text = c()
text[1] = paste0(‘The observed t-value with d = ‘,obs.es,’ and N = ‘,N,’ is t(‘,df,’) = ‘,round(obs.es/se,2))
text[2] = paste0(‘The 95% confidence interal is ‘,round(obs.es-1.96*se,2),’ to ‘,round(obs.es+1.96*se,2))
text[3] = paste0(‘Weighted Mean Density(H0:d >= ‘,H0.range[1],’ & <= ‘,H0.range[2],’) = ‘,round(H0,5))
text[4] = paste0(‘Weighted Mean Density(H1:d <= ‘,H0.range[1],’ | => ‘,H0.range[2],’) = ‘,round(H1,5))
text[5] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H0/H1: ‘,round(BF01,2))
text[6] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H1/H0: ‘,round(BF10,2))
text[7] = paste0(‘The a priori likelihood ratio of H1/H0 is ‘,round(Pr.H1,2),’/’,round(Pr.H0,2),’ = ‘,round(PriorOdds,2))
text[8] = paste0(‘The Bayes Ratio(H1/H0) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio10,2))
text[9] = paste0(‘The Bayes Ratio(H0/H1) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio01,2))
### print output
text