Category Archives: Uncategorized

Bayesian Meta-Analysis: The Wrong Way and The Right Way

Carlsson, R., Schimmack, U., Williams, D.R., & Bürkner, P. C. (in press). Bayesian Evidence Synthesis is no substitute for meta-analysis: a re-analysis of Scheibehenne, Jamil and Wagenmakers (2016). Psychological Science.

In short, we show that the reported Bayes-Factor of 36 in the original article is inflated by pooling across a heterogeneous set of studies, using a one-sided prior, and assuming a fixed effect size.  We present an alternative Bayesian multi-level approach that avoids the pitfalls of Bayesian Evidence Synthesis, and show that the original set of studies produced at best weak evidence for an effect of social norms on reusing of towels.

Advertisements

Peer-Reviews from Psychological Methods

Times are changing. Media are flooded with fake news and journals are filled with fake novel discoveries. The only way to fight bias and fake information is full transparency and openness.
 
Jerry Brunner and I wrote a paper that examined the validity of z-curve, the method underlying powergraphs, to Psychological Methods.

As soon as we submitted it, we made the manuscript and the code available. Nobody used the opportunity to comment on the manuscript. Now we got the official reviews.

We would like to thank the editor and reviewers for spending time and effort on reading (or at least skimming) our manuscript and writing comments.  Normally, this effort would be largely wasted because like many other authors we are going to ignore most of their well-meaning comments and suggestions and try to publish the manuscript mostly unchanged somewhere else. As the editor pointed out, we are hopeful that our manuscript will eventually be published because 95% of written manuscripts get eventually published. So, why change anything.  However, we think the work of the editor and reviewers deserves some recognition and some readers of our manuscript may find them valuable. Therefore, we are happy to share their comments for readers interested in replicabilty and our method of estimating replicability from test statistics in original articles. 

 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Dear Dr. Brunner,

I have now received the reviewers’ comments on your manuscript. Based on their analysis and my own evaluation, I can no longer consider this manuscript for publication in Psychological Methods. There are two main reasons that I decided not to accept your submission. The first deals with the value of your statistical estimate of replicability. My first concern is that you define replicability specifically within the context of NHST by focusing on power and p-values. I personally have fewer problems with NHST than many methodologists, but given the fact that the literature is slowly moving away from this paradigm, I don’t think it is wise to promote a method to handle replicability that is unusable for studies that are conducted outside of it. Instead of talking about replicability as estimating the probability of getting a significant result, I think it would be better to define it in more continuous terms, focusing on how similar we can expect future estimates (in terms of effect sizes) to be to those that have been demonstrated in the prior literature. I’m not sure that I see the value of statistics that specifically incorporate the prior sample sizes into their estimates, since, as you say, these have typically been inappropriately low.

Sure, it may tell you the likelihood of getting significant results if you conducted a replication of the average study that has been done in the past. But why would you do that instead of conducting a replication that was more appropriately powered?

Reviewer 2 argues against the focus on original study/replication study distinction, which would be consistent with the idea of estimating the underlying distribution of effects, and from there selecting sample sizes that would produce studies of acceptable power. Reviewer 3 indicates that three of the statistics you discussed are specifically designed for single studies, and are no longer valid when applied to sets of studies, although this reviewer does provide information about how these can be corrected.

The second main reason, discussed by Reviewer 1, is that although your statistics may allow you to account for selection biases introduced by journals not accepting null results, they do not allow you to account for selection effects prior to submission. Although methodologists will often bring up the file drawer problem, it is much less of an issue than people believe. I read about a survey in a meta-analysis text (I unfortunately can’t remember the exact citation) that indicated that over 95% of the studies that get written up eventually get published somewhere. The journal publication bias against non-significant results is really more an issue of where articles get published, rather than if they get published. The real issue is that researchers will typically choose not to write up results that are non-significant, or will suppress non-significant findings when writing up a study with other significant findings. The latter case is even more complicated, because it is often not just a case of including or excluding significant results, but is instead a case where researchers examine the significant findings they have and then choose a narrative that makes best use of them, including non-significant findings when they are part of the story but excluding them when they are irrelevant. The presence of these author-side effects means that your statistic will almost always be overestimating the actual replicability of a literature.

The reviewers bring up a number of additional points that you should consider. Reviewer 1 notes that your discussion of the power of psychological studies is 25 years old, and therefore likely doesn’t apply. Reviewer 2 felt that your choice to represent your formulas and equations using programming code was a mistake, and suggests that you stick to standard mathematical notation when discussing equations. Reviewer 2 also felt that you characterized researcher behaviors in ways that were more negative than is appropriate or realistic, and that you should tone down your criticisms of these behaviors. As a grant-funded researcher, I can personally promise you that a great many researchers are concerned about power,since you cannot receive government funding without presenting detailed power analyses. Reviewer 2 noted a concern with the use of web links in your code, in that this could be used to identify individuals using your syntax. Although I have no suspicions that you are using this to keep track of who is reviewing your paper, you should remove those links to ensure privacy. Reviewer 1 felt that a number of your tables were not necessary, and both reviewers 2 and 3 felt that there were parts of your writing that could be notably condensed. You might consider going through the document to see if you can shorten it while maintaining your general points. Finally, reviewer 3 provides a great many specific comments that I feel would greatly enhance the validity and interpretability of your results. I would suggest that you attend closely to those suggestions before submitting to another journal.

For your guidance, I append the reviewers’ comments below and hope they will be useful to you as you prepare this work for another outlet.

Thank you for giving us the opportunity to consider your submission.

Sincerely, Jamie DeCoster, PhD
Associate Editor
Psychological Methods

 

Reviewers’ comments:

Reviewer #1:

The goals of this paper are admirable and are stated clearly here: “it is desirable to have an alternative method of estimating replicability that does not require literal replication. We see this method as complementary to actual replication studies.”

However, I am bothered by an assumption of this paper, which is that each study has a power (for example, see the first two paragraphs on page 20). This bothers me for several reasons. First, any given study in psychology will often report many different p-values. Second, there is the issue of p-hacking or forking paths. The p-value, and thus the power, will depend on the researcher’s flexibility in analysis. With enough researcher degrees of freedom, power approaches 100% no matter how small the effect size is. Power in a preregistered replication is a different story. The authors write, “Selection for significance (publication bias) does not change the power values of individual studies.” But to the extent that there is selection done _within_ a study–and this is definitely happening–I don’t think that quoted sentence is correct.

So I can’t really understand the paper as it is currently written, as it’s not clear to me what they are estimating, and I am concerned that they are not accounting for the p-hacking that is standard practice in published studies.

Other comments:

The authors write, “Replication studies ensure that false positives will be promptly discovered when replication studies fail to confirm the original results.” I don’t think “ensure” is quite right, since any replication is itself random. Even if the null is true, there is a 5% chance that a replication will confirm just by chance. Also many studies have multiple outcomes, and if any appears to be confirmed, this can be taken as a success. Also, replications will not just catch false positives, they will also catch cases where the null hypothesis is false but where power is low. Replication may have the _goal_ of catching false positives, but it is not so discriminating.

The Fisher quote, “A properly designed experiment rarely fails to give …significance,” seems very strange to me. What if an experiment is perfectly designed, but the null hypothesis happens to be true? Then it should have a 95% chance of _not_ giving significance.

The authors write, “Actual replication studies are needed because they provide more information than just finding a significant result again. For example, they show that the results can be replicated over time and are not limited to a specific historic, cultural context. They also show that the description of the original study was sufficiently precise to reproduce the study in a way that it successfully replicated the original result.” These statements seem too strong to me. Successful replication is rejection of the null, and this can happen even if the original study was not described precisely, etc.

The authors write, “A common estimate of power is that average power is about 50% (Cohen 1962, Sedlmeier and Gigerenzer 1989). This means that about half of the studies in psychology have less than 50% power.” I think they are confusing the mean with the median here. Also I would guess that 50% power is an overestimate. For one thing, psychology has changed a lot since 1962 or even 1989 so I see no reason to take this 50% guess seriously.

The authors write, “We define replicability as the probability of obtaining the same result in an exact replication study with the same procedure and sample sizes.” I think that by “exact” they mean “pre-registered” but this is not clear. For example, suppose the original study was p-hacked. Then, strictly speaking, an exact replication would also be p-hacked. But I don’t think that’s what the authors mean. Also, it might be necessary to restrict the definition to pre-registered studies with a single test. Otherwise there is the problem that a paper has several tests, and any rejection will be taken as a successful replication.

I recommend that the authors get rid of tables 2-15 and instead think more carefully about what information they would like to convey to the reader here.

Reviewer #2:

This paper is largely unclear, and in the areas where it is clear enough to decipher, it is unwise and unprofessional.

This study’s main claim seems to be: “Thus, statistical estimates of replicability and the outcome of replication studies can be seen as two independent methods that are expected to produce convergent evidence of replicability.” This is incorrect. The approaches are unrelated. Replication of a scientific study is part of the scientific process, trying to find out the truth. The new study is not the judge of the original article, its replicability, or scientific contribution. It is merely another contribution to the scientific literature. The replicator and the original article are equals; one does not have status above the other. And certainly a statistical method applied to the original article has no special status unless the method, data, or theory can be shown to be an improvement on the original article.

They write, “Rather than using traditional notation from Statistics that might make it difficult for non-statisticians to understand our method, we use computer syntax as notation.” This is a disqualifying stance for publication in a serious scholarly journal, and it would an embarrassment to any journal or author to publish these results. The point of statistical notation is clarity, generality, and cross-discipline understanding. Computer syntax is specific to the language adopted, is not general, and is completely opaque to anyone who uses a different computer language. Yet everyone who understands their methods will have at least seen, and needs to understand, statistical notation. Statistical (i.e., mathematical) notation is the one general language we have that spans the field and different fields. No computer syntax does this. Proofs and other evidence are expressed in statistical notation, not computer syntax in the (now largely unused) S statistical language. Computer syntax, as used in this paper, is also ill-defined in that any quantity defined by a primitive function of the language can change any time, even after publication, if someone changes the function. In fact, the S language, used in this paper, is not equivalent to R, and so the authors are incorrect that R will be more understandable. Not including statistical notation, when the language of the paper is so unclear and self-contradictory, is an especially unfortunate decision. (As it happens I know S and R, but I find the manuscript very difficult to understand without imputing my own views about what the authors are doing. This is unacceptable. It is not even replicable.) If the authors have claims to make, they need to state them in unambiguous mathematical or statistical language and then prove their claims. They do not do any of these things.

It is untrue that “researchers ignore power”. If they do, they will rarely find anything of interest. And they certainly write about it extensively. In my experience, they obsess over power, balancing whether they will find something with the cost of doing the experiment. In fact, this paper misunderstands and misrepresents the concept: Power is not “the long-run probability of obtaining a statistically significant result.” It is the probability that a statistical test will reject a false null hypothesis, as the authors even say explicitly at times. These are very different quantities.

This paper accuses “researchers” of many other misunderstandings. Most of these are theoretically incorrect or empirically incorrect.One point of the paper seems to be “In short, our goal is to estimate average power of a set of studies with unknown population effect sizes that can assume any value, including zero.” But I don’t see why we need to know this quantity or how the authors’ methods contribute to us knowing it. The authors make many statistical claims without statistical proofs, without any clear definition of what their claims are, and without empirical evidence. They use simulation that inquires about a vanishingly small portion of the sample space to substitute for an infinite domain of continuous parameter values; they need mathematical proofs but do not even state their claims in clear ways that are amenable to proof.

No coherent definition is given of the quantity of interest. “Effect size” is not generic and hypothesis tests are not invariant to the definition, even if it is true that they are monotone transformations of each other. One effect size can be “significant” and a transformation of the effect size can be “not significant” even if calculated from the same data. This alone invalidates the authors’ central claims.

The first 11.5 pages of this paper should be summarized in one paragraph. The rest does not seem to contribute anything novel. Much of it is incorrect as well. Better to delete throat clearing and get on with the point of the paper.

I’d also like to point out that the authors have hard-coded URL links to their own web site in the replication code. The code cannot be run without making a call to the author’s web site, and recording the reviewer’s IP address in the authors’ web logs. Because this enables the authors to track who is reviewing the manuscript, it is highly inappropriate. It also makes it impossible to replicate the authors results. Many journals (and all federal grants) have prohibitions on this behavior.

I haven’t checked whether Psychological Methods has this rule, but the authors should know better regardless.

Reviewer 3

Review of “How replicable is psychology? A comparison of four methods of estimating replicability on the bias of test statistics in original studies”

It was my pleasure to review this manuscript. The authors compare four methods of estimating replicability. One undeniable strength of the general approach is that these measures of replicability can be computed before or without actually replicating the study/studies. As such, one can see the replicability measure of a set of statistically significant findings as an index of trust in these findings, in the sense that the measure provides an estimate of the percentage of these studies that is expected to be statistically significant when replicating them under the same conditions and same sample size (assuming the replication study and the original study assess the same true effect). As such, I see value in this approach. However, I have many comments, major and minor, which will enable the authors to improve their manuscript.

Major comments

1. Properties of index.

What I miss, and what would certainly be appreciated by the reader, is a description of properties of the replicability index. This would include that it has a minimum value equal to 0.05 (or more generally, alpha), when the set of statistically significant studies has no evidential value. Its maximum value equals 1, when the power of studies included in the set was very large. A value of .8 corresponds to the situation where statistical power of the original situation was .8, as often recommended. Finally, I would add that both sample size and true effect size affect the replicability index; a high value of say .8 can be obtained when true effect size is small in combination with a large sample size (you can consider giving a value of N, here), or with a large true effect size in combination with a small sample size (again, consider giving values).

Consider giving a story like this early, e.g. bottom of page 6.

2. Too long explanations/text

Perhaps it is a matter of taste, but sometimes I consider explanations much too long. Readers of Psychological Methods may be expected to know some basics. To give you an example, the text on page 7 in “Introduction of Statistical Methods for Power estimation” is very long. I believe its four paragraphs can be summarized into just one; particularly the first one can be summarized in one or two sentences. Similarly, the section on “Statistical Power” can be shortened considerably, imo. Other specific suggestions for shortening the text, I mention below in the “minor comments” section. Later on I’ll provide one major comment on the tables, and how to remove a few of them and how to combine several of them.

3. Wrong application of ML, p-curve, p-uniform

This is THE main comment, imo. The problem is that ML (Hedges, 1984), p-curve, p-uniform, enable the estimation of effect size based on just ONE study. Moreover,  Simonsohn (p-curve) as well as the authors of p-uniform would argue against estimating the average effect size of unrelated studies. These methods are meant to meta-analyze studies on ONE topic.

4. P-uniform and p-curve section, and ML section

This section needs a major revision. First, I would start the section with describing the logic of the method. Only statistically significant results are selected. Conditional on statistical significance, the methods are based on conditional p-values (not just p-values), and then I would provide the formula on top of page 18. Most importantly, these techniques are not constructed for estimating effect size of a bunch of unrelated studies. The methods should be applied to related studies. In your case, to each study individually. See my comments earlier.

Ln(p), which you use in your paper, is not a good idea here for two reasons: (1) It is most sensitive to heterogeneity (which is also put forward by Van Assen et al (2014), and (2) applied to single studies it estimates effect size such that the conditional p-value equals 1/e, rather than .5  (resulting in less nice properties).

The ML method, as it was described, focuses on estimating effect size using one single study (see Hedges, 1984). So I was very surprised to see it applied differently by the authors. Applying ML in the context of this paper should be the same as p-uniform and p-curve, using exactly the same conditional probability principle. So, the only difference between the three methods is the method of optimization. That is the only difference.

You develop a set-based ML approach, which needs to assume a distribution of true effect size. As said before, I leave it up to you whether you still want to include this method. For now, I have a slight preference to include the set-based approach because it (i) provides a nice reference to your set-based approach, called z-curve, and (ii) using this comparison you can “test” how robust the set-based ML approach is against a violation of the assumption of the distribution of true effect size.

Moreover, I strongly recommend showing how their estimates differ for certain studies, and include this in a table. This allows you to explain the logic of the methods very well. Here a suggestion. I would provide the estimates of four methods (…) for p-values .04, .025, .01, .001, and perhaps .0001). This will be extremely insightful. For small p-values, the three methods’ estimates will be similar to the traditional estimate. For p-values > .025, the estimate will be negative, for p = .025 the estimate will be (close to) 0. Then, you can also use these same studies and p-values to calculate the power of a replication study (R-index).

I would exclude Figure 1, and the corresponding text. Is not (no longer) necessary.

For the set-based ML approach, if you still include it, please explain how you get to the true value distribution (g(theta)).

5a. The MA set, and test statistics

Many different effect sizes and test statistics exist. Many of them can be transformed to ONE underlying parameter, with a sensible interpretation and certain statistical properties. For instance, the chi2, t, and F(1,df) can all be transformed to d or r, and their SE can be derived. In the RPP project and by John et al (2016) this is called the MA set. Other test statistics, such as F(>1, df) cannot be converted to the same metric, and no SE is defined on that metric. Therefore, the statistics F(>1,df) were excluded from the meta-analyses in the RPP  (see the supplementary materials of the RPP) and by Johnson et al (2016) and also Morey and Lakens (2016), who also re-analyzed the data of the RPP.

Fortunately, in your application you do not estimate effect size but only estimate power of a test, which only requires estimating the ncp and not effect size. So, in principle you can include the F(>1,df) statistics in your analyses, which is a definite advantage. Although I can see you can incorporate it for the ML, p-curve, p-uniform approach, I do not directly see how these F(>1,df) statistics can be used for the two set-based methods (ML and z-curve); in the set-based methods, you put all statistics on one dimension (z) using the p-values. How do you defend this?

5b. Z-curve

Some details are not clear to me, yet. How many components (called r in your text) are selected, and why? Your text states: “First, select a ncp parameter m ; . Then generate Z from a normal distribution with mean m ; I do not understand, since the normal distribution does not have an ncp. Is it that you nonparametrically model the distribution of observed Z, with different components?

Why do you use kernel density estimation? What is it’s added value? Why making it more imprecise by having this step in between? Please explain.

Except for these details, procedure and logic of z-curve are clear

6. Simulations (I): test statistics

I have no reasons, theoretical or empirical, why the analyses would provide different results for Z, t, F(1,df), F(>1,df), chi2. Therefore, I would omit all simulation results of all statistics except 1, and not talk about results of these other statistics. For instance, in the simulations section I would state that results are provided on each of these statistics but present here only the results of t, and of others in supplementary info. When applying the methods to RPP, you apply them to all statistics simultaneously, which you could mention in the text (see also comment 4 above).

7. mean or median power (important)

One of my most important points is the assessment of replicability itself. Consider a set of studies for which replicability is calculated, for each study. So, in case of M studies, there are M replicability indices. Which statistics would be most interesting to report, i.e., are most informative? Note that the distribution of power is far some symmetrical, and actually may be bimodal with modes at 0.05 and 1.  For that reason alone, I would include in any report of replicability in a field the proportion of R-indices equal to 0.05 (which amounts to the proportion of results with .025 < p < .05) and the proportion of R-indices equal to 1.00 (e.g., using two decimals, i.e. > .995). Moreover, because power values are recommend of .8 or more, I also could include the proportion of studies with power > .8.

We also would need a measure of central tendency. Because the distribution is not symmetric, and may be skewed, I recommend using the median rather than the mean. Another reason to use the median rather than the mean is because the mean does not provide useable information on whether methods are biased or not, in the simulations. For instance, if true effect size = 0, because of sampling error the observed power will exceed .05 in exactly 50% of the cases (this is the case for p-uniform; since with probability .5 the p-value will exceed .025) and larger than .05 in the other 50% of the cases. Hence, the median will be exactly equal to .05, whereas the mean will exceed .05. Similarly, if true effect size is large the mean power will be too small (distribution skewed to the left). To conclude, I strongly recommend including the median in the results of the simulation.

In a report, such as for the RPP later on in the paper, I recommend including (i)

p(R=.05), (ii) p(R >= .8), (iii) p(R>= .995), (iv) median(R), (v) sd(R), (vi)

distribution R, (vii) mean R. You could also distinguish this for soc psy and cog psy.

8. simulations (II): selection of conditions

I believe it is unnatural to select conditions based on “mean true power” because we are most familiar with effect size and their distribution, and sample sizes and their distribution. I recommend describing these distributions, and then the implied power distribution (surely the median value as well, not or not only the mean).

9.  Omitted because it could reveal identity of reviewer

10. Presentation of results

I have comments on what you present, on how you present the results. First, what you present. For the ML and p-methods, I recommend presenting the distribution of R in each of the conditions (at least for fixed true effect size and fixed N, where results can be derived exactly relatively easy). For the set-based methods, if you focus on average R (which I do not recommend, I recommend median R), then present the RMSE. The absolute median error is minimized when you use the median. So, average-RMSE is a couple, and median-absolute error is a couple.

Now the presentation of results. Results of p-curve/p-uniform/ML are independent of the number of tests, but set-based methods (your ML variant) and z-curve are not.

Here the results I recommend presenting:

Fixed effect size, heterogeneity sample size

**For single-study methods, the probability distribution of R (figure), including mean(R), median(R), p(R=.05), p(R>= .995), sd(R). You could use simulation for approximating this distribution. Figures look like those in Figure 3, to the right.

**Median power, mean/sd as a function of K

**Bias for ML/p-curve/p-uniform amounts to the difference between median of distribution and the actual median, or the difference between the average of the distribution and the actual average. Note that this is different from set-based methods.

**For set-based methods, a table is needed (because of its dependence on k).

Results can be combined in one table (i.e., 2-3, 5-6, etc)

Significance tests comparing methods

I would exclude Table 4, Table 7, Table 10, Table 13. These significance tests do not make much sense. One method is better than another, or not – significance should not be relevant (for a very large number of iterations, a true difference will show up). You could simply describe in the text which method works best.

Heterogeneity in both sample size and effect size

You could provide similar results as for fixed effect size (but not for chi2, or other statistics). I would also use the same values of k as for the fixed effect case. For the fixed effect case you used 15, 25, 50, 100, 250. I can imagine using as values of k for both conditions k = 10, 30, 100, 400, 2,000 (or something).

Including the k = 10 case is important, because set-based methods will have more problems there, and because one paper or a meta-analysis or one author may have published just one or few statistically significant effect sizes. Note, however, that k=2,000 is only realistic when evaluating a large field.

Simulation of complex heterogeneity

Same results as for fixed effect size and heterogeneity in both sample size and effect size. Good to include a condition where the assumption of set-based ML is violated. I do not yet see why a correlation between N and ES may affect the results. Could you explain? For instance, for the ML/p-curve/p-uniform methods, all true effect sizes in combination with N result in a distribution of R for different studies; how this distribution is arrived at, is not relevant, so I do not yet see the importance of this correlation. That is, this correlation should only affect the results through the distribution of R. More reasoning should be provided, here.

Simulation of full heterogeneity

I am ambivalent about this section. If test statistic should not matter, then what is the added value of this section? Other distributions of sample size may be incorporated in previous section “complex heterogeneity”;. Other distributions of true effect may also be incorporated in the previous section. Note that Johnson et al (2016) use the RPP data to estimate that 90% of effects in psychology estimate a true zero effect. You assume only 10%.

Conservative bootstrap

Why only presenting the results of z-curve? By changing the limits of the interval, the interpretation becomes a bit awkward; what kind of interval is it now? Most importantly, coverages of .9973 or .9958 are horrible (in my opinion, these coverages are just as bad as coverages of .20). I prefer results of 95% confidence intervals, and then show their coverages in the table. Your &lsquo;conservative&rsquo; CIs are hard to interpret. Note also that this is paper on statistical properties of the methods, and one property is how well the methods perform w.r.t. 95% CI.

By the way, examining 95% CI of the methods is very valuable.

11. RPP

In my opinion, this section should be expanded substantially. This is where you can finally test your methodology, using real data! What I would add is the following: **Provide the distribution of R (including all statistics mentioned previously, i.e. p(R=0.05), p(R >= .8), p(R >= .995), median(R), mean(R), sd(R), using single-study methods **Provide previously mentioned results for soc psy and cog psy separately **Provide results of z-curve, and show your kernel density curve (strange that you never show this curve, if it is important in your algorithm).  What would be really great, is if you predict the probability of replication success (power) using the effect size estimate based on the original effect size estimated (derived from a single study) and the N of the replication sample. You could make a graph with on the X-axis this power, and on the Y-axis the result of the replication. Strong evidence in favor of your method would be if your result better predicts future replicability than any other index (see RPP for what they tried). Logistic regression seems to be the most appropriate technique for this.

Using multiple logistic regression, you can also assess if other indices have an added value above your predictions.

To conclude, for now you provide too limited results to convince readers that your approach is very useful.

Minor comments

P4 top: “heated debates” A few more sentence on this debate, including references to those debates would be fair. I would like to mention/recommend the studies of Maxwell et al (2015) in American Psychologist, the comment on the OSF piece in Science, and its response, and the very recent piece of Valen E Johnson et al (2016).

P4, middle: consider starting a new paragraph at “Actual replication”; In the sentence after this one, you may add “or not”;.

Another advantage of replication is that it may reveal heterogeneity (context dependence). Here, you may refer to the ManyLabs studies, which indeed reveal heterogeneity in about half of the replicated effects. Then, the next paragraph may start with “At the same time” To conclude, this piece starting with “Actual replication”; can be expanded a bit

P4, bottom,  “In contrast”; This and the preceding sentence is formulated as if sampling error does not exist. It is much too strong! Moreover, if the replication study had low power, sampling error is likely the reason of a statistically insignificant result. Here you can be more careful/precise. The last sentence of this paragraph is perfect.

P5, middle: consider adding more refs on estimates of power in psychology, e.g. Bakker and Wicherts 35% and that study on neuroscience with power estimates close to 20%. Last sentence of the same paragraph; assuming same true effect and same sample size.

P6, first paragraph around Rosenthal. Consider referring to the study of Johson et al (2016), who used a Bayesian analysis to estimate how many non-significant studies remain unpublished.

P7, top: &ldquo;studies have the same power (homogenous case) “(heterogenous case). This is awkward. Homogeneity and heterogeneity is generally reserved for variation in true effect size. Stick to that. Another problem here is that “heterogeneous”; power can be created by “heterogeneity”; in sample size and/or heterogeneity in effect size. These should be distinguished, because some methods can deal with heterogeneous power caused by heterogeneous N, but not heterogeneous true effect size. So, here, I would simple delete the texts between brackets.

P7, last sentence of first paragraph; I do not understand the sentence.

P10, “average power”. I did not understand this sentence.

P10, bottom: Why do you believe these methods to be most promising?

P11, 2nd par: Rephrase this sentence. Heterogeneity of effect size is not because of sampling variation. Later in this paragraph you also mix up heterogeneity with variation in power again. Of course, you could re-define heterogeneity, but I strongly recommend not doing so (in order not to confuse others); reserve heterogeneity to heterogeneity in true effect size.

P11, 3rd par, 1st sentence: I do not understand this sentence. But then again, this sentence may not be relevant (see major comments), because for applying p-uniform and p-curve heterogeneity of effect size is not relevant.

P11 bottom: maximum likelihood method. This sentence is not specific enough. But then again, this sentence may not be relevant (see major comments).

P12: Statistics without capital.

P12: “random sampling distribution”; delete “random”;. By the way, I liked this section on Notation and statistical background.

Section “Two populations of power”;. I believe this section is unnecessarily long, with a lot of text. Consider shortening. The spinning wheel analogy is ok.

P16, “close to the first” You mean second?

P16, last paragraph, 1st sentence: English?

Principle 2: The effect on what? Delete last sentence in the principle.

P17, bottom: include the average power after selection in your example.

p-curve/p-uniform: modify, as explained in one of the major comments.

P20, last sentence: Modify the sentence – the ML approach has excellent properties asymptotically, but not sample size is small. Now it states that it generally yields more precise estimates.

P25, last sentence of 4. Consider deleting this sentence (does not add anything useful).

P32: “We believe that a negative correlation between” some part of sentence is missing.

P38, penultimate sentence: explain what you mean by “decreasing the lower limit by .02”; and “increasing the upper limit by .02”;.

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Manuscript under review, copyright belongs to Jerry Brunner and Ulrich Schimmack

How replicable is psychology? A comparison of four methods of estimating replicability on the basis of test statistics in original studies

Jerry Brunner and Ulrich Schimmack
University of Toronto @ Mississauga

Abstract
In the past five years, the replicability of original findings published in psychology journals has been questioned. We show that replicability can be estimated by computing the average power of studies. We then present four methods that can be used to estimate average power for a set of studies that were selected for significance: p-curve, p-uniform, maximum likelihood, and z-curve. We present the results of large-scale simulation studies with both homogeneous and heterogeneous effect sizes. All methods work well with homogeneous effect sizes, but only maximum likelihood and z-curve produce accurate estimates with heterogeneous effect sizes. All methods overestimate replicability using the Open Science Collaborative reproducibility project and we discuss possible reasons for this. Based on the simulation studies, we recommend z-curve as a valid method to estimate replicability. We also validated a conservative bootstrap confidence interval that makes it possible to use z-curve with small sets of studies.

Keywords: Power estimation, Post-hoc power analysis, Publication bias, Maximum likelihood, P-curve, P-uniform, Z-curve, Effect size, Replicability, Simulation.

Link to manuscript:  http://www.utstat.utoronto.ca/~brunner/zcurve2016/HowReplicable.pdf

Link to website with technical supplement:
http://www.utstat.utoronto.ca/~brunner/zcurve2016/

 

 

A Critical Review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”

In this review of Schwarz and Strack’s (1999) “Reports of Subjective Well-Being: Judgmental Processes and Their Methodological Implications”, I present verbatim quotes form their chapter and explain why these statements are misleading or false, and how the authors distort the actual evidence by selectively citing research that supports their claims, while hiding evidence that contradicts their claims. I show that the empirical evidence for the claims made by Schwarz and Strack is weak and biased.

Unfortunately, this chapter has had a strong influence on Daniel Kahneman’s attitude towards life-satisfaction judgments and his fame as a Noble laureate has led many people to believe that life-satisfaction judgments are highly sensitive to the context in which these questions are asked and practically useless for the measurement of well-being.  This has led to claims that wealth is not a predictor of well-being, but only a predictor of invalid life-satisfaction judgments (Kahneman et al., 2006) or that the effects of wealth on well-being are limited to low incomes.  None of these claims are valid because they rely on the unsupported assumption that life-satisfaction judgments are invalid measures of well-being.

The original quotes are highlighted in bold followed by my comments.

Much of what we know about individuals’ subjective well-being (SWB) is based on self-reports of happiness and life satisfaction.

True. The reason is that sociologists developed brief, single-item measures of well-being that could be included easily in large surveys such as the World Value Survey, the German Socio-Economic Panel, or the US General Social Survey.  As a result, there is a wealth of information about life-satisfaction judgments that transcends scientific disciplines. The main contribution of social psychologists to this research program that examines how social factors influence human well-being has been to dismiss the results based on claims that the measure of well-being is invalid.

As Angus Campbell (1981) noted, the “use of these measures is based on the assumption that all the countless experiences people go through from day to day add to . . . global feelings of well-being, that these feelings remain relatively constant over extended periods, and that people can describe them with candor and accuracy”

Half true.  Like all self-report measures, the validity of life-satisfaction judgments depends on respondents’ ability and willingness to provide accurate information.  However, it is not correct to suggest that life-satisfaction judgments assume that feelings remain constant over extended periods of time or that respondents have to rely on feelings to answer questions about their satisfaction with life.  There is a long tradition in the well-being literature to distinguish cognitive measures of well-being like Cantrill’s ladder and affective measures that focus on affective experiences in the recent past like Bradburn’s affect balance scale.  The key assumption underlying life-satisfaction judgments is that respondents have chronically accessible information about their lives or can accurately estimate the frequency of positive and negative feelings. It is not necessary that the feelings are stable.

These assumptions have increasingly been drawn into question, however, as the empirical work has progressed.

It is not clear which assumptions have been drawn into question.  Are people unwilling to report their well-being, are they unable to do so, or are feelings not as stable as they are assumed to be? Moreover, the statement ignores a large literature that has demonstrated validity of well-being measures going back to the 1930s (see Diener et al., 2009; Scheider & Schimmack, 2009, for a meta-analysis).

First, the relationship between individuals’ experiences and objective conditions of life and their subjective sense of well-being is often weak and sometimes counter-intuitive.  Most objective life circumstances account for less than 5 percent of the variance in measures of SWB, and the combination of the circumstances in a dozen domains of life does not account for more than 10 percent (Andrews and Whithey 1976; Kammann; 1982; for a review, see Argyle, this volume).

 

First, it is not clear what weak means. How strong should the correlation between objective conditions of life and subjective well-being be?  For example, should marital status be a strong predictor of happiness? Maye it matters more whether people are happily married or unhappily married than whether they are married or single?  Second, there is no explanation for the claim that these relationships are counter-intuitive.  Employment, wealth, and marriage are positively related to well-being as most people would expect. The only finding in the literature that may be considered counter-intuitive is that having children does not notably increase well-being and sometimes decreases well-being. However, this does not mean well-being measures are false, it may mean that people’s intuitions about the effects of life-events on well-being are wrong. If intuitions would always be correct, we would not need scientific studies of determinants of well-being.

 

Second, measures of SWB have low test-retest reliabilities, usually hovering around .40, and not exceeding .60 when the same question is asked twice during the same one-hour interview (Andrews and Whithey 1976; Glatzer 1984). 

 

This argument ignores that responses to a single self-report item often have a large amount of random measurement error, unless participants can recall their previous answer.  The typical reliability of a single-item self-report measure is about r  =.6 +/- .2.  There is nothing unique about the results reported here for well-being measures. Moreover, the authors blatantly ignore evidence that scales with multiple items like Diener’s Satisfaction with Life Scale have retest correlations over r = .8 over a one-month period (see Schimmack & Oishi, 2005, for a meta-analysis).  Thus, this statement is misleading and factually incorrect.

 

Moreover, these measures are extremely sensitive to contextual influences.

 

This claim is inconsistent with the high retest correlation over periods of one month. Moreover, survey researchers have conducted numerous studies in which they examined the influence of the survey context on well-being measures and a meta-analysis of these studies shows only a small effect of previous items on these judgments and the pattern of results is not consistent across studies (see Schimmack & Oishi, 2005 for a meta-analysis).

 

Thus, minor events, such as finding a dime (Schwarz 1987) or the outcome of soccer games (Schwarz et al. 1987), may profoundly affect reported satisfaction with one’s life as a whole.

 

As I will show, the chapter makes many statements about what may happen.  For example, finding a dime may profoundly affect well-being report or it may not have any effect on these judgments.  These statements are correct because well-being reports can be made in many different ways. The real question is how these judgments are made when well-being measures are used to measure well-being. Experimental studies that manipulate the situation cannot answer this question because they purposefully create the situation to demonstrate that respondents may use mood (when mood is manipulated) or may use information that is temporarily accessible, when relevant information is made salient and temporarily accessible. The processes underlying judgments in these experiments may reveal influences on life-satisfaction judgment in a real survey context or they may reveal processes that do not occur under normal circumstances.

 

Most important, however, the reports are a function of the research instrument and are strongly influenced by the content of preceding questions, the nature of the response alternatives, and other “technical” aspects of questionnaire design (Schwarz and Strack 1991a, 1991b).

 

We can get different answers to different questions.  The item “So far, I have gotten everything I wanted in life” may be answered differently than the item “I feel good about my life, these days.”  If so, it is important to examine which of these items is a better measure of well-being.  It does not imply that all well-being items are flawed.  The same logic applies to the response format.  If some response formats produce different results than others, it is important to determine which response formats are better for the measurement of well-being.  Last, but not least, the claim that well-being reports are “strongly influenced by the content of preceding questions” is blatantly false.  A meta-analysis shows that strong effects were only observed in two studies by Strack, but that other studies find much weaker or no effects (see Schimmack & Oishi, 2005, for a meta-analysis).

 

Such findings are difficult to reconcile with the assumption that subjective social indicators directly reflect stable inner states of well-being (Campbell 1981) or that the reports are based on careful assessments of one’s objective conditions in light of one’s aspirations (Glatzer and Zapf 1984). Instead, the findings suggest that reports of SWB are better conceptualized as the result of a judgment process that is highly context-dependent.

 

Indeed. A selective and bias list of evidence is inconsistent with the hypothesis that well-being reports are valid measures of well-being, but this only shows that the authors misrepresent the evidence, not that well-being reports lack validity, which was carefully examined in Andrew & Withey’s book (1976), which the authors cite without mentioning the evidence presented in the book for the usefulness of well-being reports.

 

[A PREVIEW]

 

Not surprisingly, individuals may draw on a wide variety of information when asked to assess the subjective quality of their lives.

 

Indeed. This means that it is impossible to generalize from an artificial context created in an experiment to the normal conditions of a well-being survey because respondents may use different information in the experiment than in the naturalistic context. The experiment may led respondents to use information that they normally would not use.

 

[USING INFORMATION ABOUT ONE’S OWN LIFE: INTRAINDIVIDUAL COMPARISONS]

 

Comparison-based evaluative judgments require a mental representation of the object of judgment, commonly called a target, as well as a mental representation of a relevant standard to which the target can be compared.

 

True. In fact, Cantril’s ladder explicitly asks respondents to compare their actual life to the best possible life they could have and the worst possible life they could have.  We can think about these possible lives as imaginary intrapersonal comparisons.

 

When asked, “Taking all things together, how would you say things are these days?” respondents are ideally assumed to review the myriad of relevant aspects of their lives and to integrate them into a mental representation of their life as a whole.”

 

True, this is the assumption underlying the use of well-being reports as measures of well-being.

 

In reality, however, individuals rarely retrieve all information that may be relevant to a judgment

 

This is also true. It is impossible to retrieve ALL of the relevant information. But it is possible that respondents retrieve most of the relevant information or enough relevant information to make these judgments valid. We do not require 100% validity for measures to be useful.

 

Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987).

 

This is also plausible. The question is what would be the criterion for sufficient certainty for well-being judgments and whether this level of certainty is reached without retrieval of relevant information. For example, if I have to report how satisfied I am with my life overall and I am thinking first about my marriage would I stop there or would I think that my overall life is more than my marriage and also think about my work?  Depending on the answer to this question, well-being judgments may be more or less valid.

 

Hence, the judgment is based on the information that is most accessible at that point in time. In general, the accessibility of information depends on the recency and frequency of its use (for a review, see Higgins 1996).

 

This also makes sense.  A sick person may think about their health. A person in a happy marriage may think about their loving wife, and a person with financial problems may think about their problems paying bills.  Any life domain that is particularly salient in a person’s life is also likely to be a salient when they are confronted with a life-satisfaction question. However, we still do not know which information people will use and how much information they will use before they consider their judgment sufficiently accurate to provide an answer. Would they use just one salient temporarily accessible piece of information or would be continue to look for more information?

 

Information that has just been used-for example, to answer a preceding question in the questionnaire-is particularly likely to come to mind later on, although only for a limited time.

 

Wait a second.  Higgins emphasized that accessibility is driven by recency and frequency (!) of use. Individual who are going through a divorce or cancer treatment have probably thought frequently about this aspect of their lives.  A single question about their satisfaction with their recreational activities may not make them judge their lives based on their hobbies. Thus, it does not follow from Higgins’s work on accessibility that preceding items have a strong influence on well-being judgments.

 

This temporarily accessible information is the basis of most context effects in survey measurement and results in variability in the judgment when the same question is asked at different times (see Schwarz and Strack 1991b; Strack 1994a; Sudman, Bradburn, and Schwarz 1996, chs. 3 to 5; Tourangeau and Rasinski 1988)

 

Once more, the evidence for these temporary accessibility effects is weak and it is not clear why well-being judgments would be highly stable over time, if they were driven by making irrelevant information temporarily accessible.  In fact, the evidence is more consistent with Higgins’ suggests that frequent of use influences well-being judgments.  Life domains that are salient to individuals are likely to influence life-satisfaction judgments because they are chronically accessible even if other information is temporarily accessible or primed by preceding questions.

 

Other information, however, may come to mind because it is used frequently-for example, because it relates to the respondent’s current concerns (Klinger 1977) or life tasks (Cantor and Sanderson, this volume). Such chronically accessible information reflects important aspects of respondents’ lives and provides for some stability in judgments over time.

 

Indeed, but look at the wording. “This temporarily accessible information IS the basis of most context effects in survey measurement” vs. “Other information, however, MAY come to mind.”  The wording is not balanced and it does not match the evidence that most of the variation in well-being reports across individuals is stable over time and only a small proportion of the variance changes systematically over time. The wording is an example of how some scientists create the illusion of a balanced literature review while pushing their biased opinions.

 

As an example, consider experiments on question order. Strack, Martin, and Schvvarz (1988) observed that dating frequency was unrelated to students’ life satisfaction when a general satisfaction question preceded a question about the respondent’s dating frequency, r = – 12.  Yet reversing the question order increased the correlation to r = .66.  Similarly, marital satisfaction correlated with general life satisfaction r = .32 when the general question preceded the marital one in another study (Schwarz, Strack, and Mai 1991). Yet reversing the question order again increased this correlation to r = .67.

 

The studies that are cited here are not representative. They show the strongest item-order effects and the effects are much stronger than the meta-analytic average (Schimmack & Oishi, 2005). Both studies were conducted by Strack. Thus, these examples are at best considered examples what might happen under very specific conditions that differ from other specific conditions where the effect was much smaller. Moreover, it is not clear why dating frequency should be a strong positive predictor of life-satisfaction. Why is my life better when I have a lot of dates as opposed to somebody who is in a steady relationship, and we would not expect a married respondent with lots of dates to be happy with their marriage. The difference between r = .32 and r = .66 is strong, but it was obtained with small samples and it is common that small samples overestimate effect sizes. In fact, large survey studies show much weaker effects. In short, by focusing on these two examples, the authors create the illusion that strong effects of preceding items are common and that these studies are just an example of these effects. In reality, these are the only two studies with extremely and unusually strong effects that are not representative of the literature. The selective use of evidence is another example of unscientific practices that undermine a cumulative science.

 

Findings of this type indicate that preceding questions may bring information to mind that respondents would otherwise not consider.

 

Yes, it may happen, but we do not know under what specific circumstances it happens.  At present, the only predictor of these strong effects is that the studies were conducted by Fritz Strack. Nobody else has reported such strong effects.

 

If this information is included in the representation that the respondent forms of his or her life, the result is an assimilation effect, as reflected in increased correlations. Thus, we would draw very different inferences about the impact of dating frequency or marital satisfaction on overall SWB, depending on the order in which the questions are asked.

 

Now the authors extrapolate from extreme examples and discuss possible theoretical implications if this were a consistent and replicable finding.  “We would draw different inferences.”  True. If this were a replicable finding and we would ask about specific life domains first, we would end up with false inferences about the importance of dating and marriage for life-satisfaction. However, it is irrelevant what follows logically from a false assumption (if Daniel Kahneman had not won the Noble price, it would be widely accepted that money buys some happiness). Second, it is possible to ask global life-satisfaction question first without making information about specific aspects of life temporarily salient.  This simple procedure would ensure that well-being reports are more strongly influenced on chronically accessible information that reflects people’s life concerns.  After all, participants may draw on chronically accessible information or temporarily accessible information and if no relevant information was made temporarily accessible, respondents will use chronically accessible information.

 

Theoretically, the impact of a given piece of accessible information increases with its extremity and decreases with the amount and extremity of other information that is temporarily or chronically accessible at the time of judgment (see Schwarz and Bless 1992a). To test this assumption, Schwarz, Strack, and Mai ( 1991) asked respondents about their job satisfaction, leisure time satisfaction, and marital satisfaction prior to assessing their general life satisfaction, thus rendering a more varied set of information accessible. In this case, the correlation between marital satisfaction and life satisfaction increased from r = .32 (in the general-marital satisfaction order) to r = .46, yet this increase was less pronounced than the r = .67 observed when marital satisfaction was the only specific domain addressed.

 

This finding also suggests that strong effects of temporarily accessible information are highly context dependent. Just asking for satisfaction with several life-domains reduces the item order effect and with the small samples in Schwarz et al. (1991), the difference between r = .32 and r = .46 is not statistically significant, meaning it could be a chance finding.  So, their own research suggests that temporarily accessible information may typically have a small effect on life-satisfaction and this conclusion would be consistent with the evidence in the literature.

 

In light of these findings, it is important to highlight some limits for the emergence of question-order effects. First, question-order effects of the type discussed here are to be expected only when answering a preceding question increases the temporary accessibility of information that is not chronically accessible anyway…  Hence, chronically accessible current concerns would limit the size of any emerging effect, and the more they do so, the more extreme the implications of these concerns are.

 

Here the authors acknowledge that there are theoretical reasons why item-order effects should typically not have a strong influence on well-being reports.  One reason is that some information such as marital satisfaction is likely to be used even if marriage is not made salient by a preceding question.  It is therefore, not clear why marital satisfaction would produce a big increase from r = .32 to r = .67, as this would imply that numerous respondents do not consider their marriage when they made the judgment and it would explain why other studies found much weaker effects for item-order effects with marital satisfaction and higher correlations between marital satisfaction and life-satisfaction than r  =.32.  However, it is interesting that this important theoretical point is offered only as a qualification after presenting evidence from two studies that did show strong item-order effects. If the argument had been presented first, the question would arise why these studies did produce strong item-order effects and it would be evident that it is impossible to generalize from these specific studies to well-being reports in general.

 

[CONSERVATION NORMS]

 

“Complicating things further, information rendered accessible by a preceding question may not always be used.”

 

How is this complicating things further?  If there are ways to communicate to respondents that they should not be influenced by previous items (e.g., “Now on to another topic” or “take a moment to think about the most important aspects of your life”) and this makes context effects disappear, why don’t we just use the proper conversational norms to avoid these undesirable effects? And some surveys actually do this and we would therefore expect that they elicit valid reports of well-being that are not based on responses to previous questions in the survey.

 

In the above studies (Strack et al. 1988; Schwarz ct al. 1991), the conversational norm of nonrcdundancy was evoked by a joint lead-in that informed respondents that they would now be asked two questions pertaining to their well-being. Following this lead-in, they first answered the specific question (about dating frequency or marital satisfaction) and subsequently reported their general life satisfaction. In this case, the previously observed correlations of r = .66 between dating frequency and life satisfaction, or of r = .67 between marital satisfaction and life satisfaction, dropped to r = -15 and .18, respectively. Thus, the same question order resulted in dramatically different correlations, depending on the elicitation of the conversational norm of nonredundancy. 

 

The only evidence for these effects comes from a couple of studies by the authors.  Even if these results hold, they suggest that it should be possible to use conversational norms to get the same results for both item-orders if conversational norms suggest that participants should use all relevant chronically accessible information.  However, the authors did not conduct such as study. One reason may be that the prediction would be that there is no effect and researchers are only interested in using manipulations that show effects so that they can reject the null-hypothesis. Another explanation could be that Schwarz and Strack’s program of research on well-being reports was built on the heuristics and bias program in social psychology that is only interested in showing biases and ignores evidence for accuracy (Funder, 1987). The only result that is deemed relevant and worthy of publishing are experiments that successfully created a bias in judgments. The problem with this approach is that it cannot reveal that these judgments are also accurate and can be used as valid measures of well-being.

 

[SUMMARY]

 

Judgments arc based on the subset of potentially applicable information that is chronically or temporarily accessible at the time.

 

Yes, it is not clear what else the judgments could be based on.

 

Accessible information, however, may not be used when its repeated use would violate conversational norms of nonredundancy.

 

Interestingly this statement would imply that participants are not influenced by subtle information (priming). The information has to be consciously accessible to determine whether it is relevant and only accessible information that is considered relevant is assumed to influence judgments.  This also implies that making information accessible that is not considered relevant will not have an influence on well-being reports. For example, asking people about their satisfaction with the weather or the performance of a local sports team does not lead to a strong influence of this information on life-satisfaction judgments because most people do not consider this information relevant (Schimmack et al., 2002). Once more, it is not clear how well-being reports can be highly context dependent, if information is carefully screened for relevance and responses are only made when sufficient relevant information was retrieved.

 

[MENTALCONSTRUALSOFONE’S LIFE AND A RELEVANT STANDARD: WHAT IS, WAS,WILL BE, AND MIGHT HAVE BEEN]

 

Suppose that an extremely positive (or negative) life event comes to mind. If this event is included in the temporary representation of the target “my life now,” it results in a more positive (negative) assessment of SWB, reflecting an assimilation effect, as observed in an increased correlation in the studies discussed earlier. However, the same event may also be used in constructing a standard of comparison, resulting in a contrast efficient: compared to an extremely positive (negative) event, one’s life in general may seem relatively bland (or pretty benign). These opposite influences of the same event are sometimes referred to as endowment (assimilation) and contrast effects (Tversky. and Griffin 1991).

 

This is certainly a possibility, but it not necessarily limited to temporarily accessible information.  A period in an individuals’ life may be evaluated relative to other periods in a person’s life.  In this way, subjective well-being is subjective. Objectively identical lives can be evaluated differently because past experiences created different ideals or comparison standards (see Cantril’s early work on human concerns).  This may happen for chronically accessible information just as much as for temporarily accessible information and it does not imply that well-being reports are invalid; it just shows that they are subjective.

 

Strack, Schwarz, and Gschneidingcr (1985, Experiment 1) asked respondents to report either three positive or three negative recent life events, thus rendering these events temporarily accessible.  As shown in the top panel of Table 1, these respondents reported higher current life satisfaction after they recalled three positive rather than negative recent events. Other respondents, however, had to recall events that happened at least five years before. These respondents reported higher current life satisfaction after recalling negative rather than positive past events. 

 

This finding shows that contrast effects can occur.  However, it is important to note that these context effects were created by the experimental manipulation.  Participants were asked to recall events from 5 years ago.  In the naturalistic scenario, where participants are simply asked to report “how is your life these days” participants are unlikely to suddenly recall events from 5 years ago.   Similarly, if you were asked about your happiness with your last vacation you are unlikely to recall earlier vacations and contrast your most recent vacation with it.  Indeed, Suh et al. (1996) showed that life-satisfaction judgments are influenced by recent events and that older events do not have an effect. They found no evidence for contrast effects when participants were not asked to recall events from the distant past.  So, this research shows what can happen in a specific context where participants were to recall extreme negative or positive from their past, but without prompting by an experimenter this context hardly ever would occur.  Thus, this study has no ecological or external validity for the question how participants actually make life-satisfaction judgments.

 

These experimental results are consistent with correlational data (Elder 1974) indicating that U.S. senior citizens, the “children of the Great Depression,” are more likely to report high subjective well-being the more they suffered under adverse economic conditions when they were adolescents. 

 

This finding again does not mean that elderly US Americans who suffered more during the Great Depression were actively thinking about the Great Depression when they answered questions about their well-being. It is more likely that they may have lower aspirations and expectations from life (see Easterlin). This means that we can interpret this result in many ways. One explanation would be that well-being judgments are subjective and that cultural and historic events can shape individuals’ evaluation standards of their lives.

 

[SUMMARY]

 

In combination, the reviewed research illustrates that the same life event may affect judgments of SWB in opposite directions, depending on its use in the construction of the target “my life now” and of a relevant standard of comparison.

 

Again, the word “may” makes this statement true. Many things may happen, but that tells us very little about what actually is happening when respondents report on their well-being.  How past negative events can become positive events (a divorce was terrible, but it feels like a blessing after being happily remarried, etc.) and positive events can become negative events (e.g., the dream of getting tenure comes true, but doing research for life happens to be less fulfilling than one anticipated) is an interesting topic for well-being research, but none of these evaluative reversals undermine the usefulness of well-being measures. In fact, they are needed to reveal that subjective evaluations have changed and that past evaluations may have carry over effects on future evaluations.

 

It therefore comes as no surprise that the relationship between life events and judgments of SWB is typically weak. Today’s disaster can become tomorrow’s standard, making it impossible to predict SWB without a consideration of the mental processes that determine the use of accessible information.

 

Actually, the relationship between life-events and well-being is not weak.  Lottery winners are happier and accident victims are unhappier.  And cross-cultural research shows that people do not simply get used to terrible life circumstances.  Starving is painful. It does not become a normal standard for well-being reports on day 2 or 3.  Most of the time, past events simply lose importance and are replaced by new events and well-being measures are meant to cover a certain life period rather than an individual’s whole life from birth to death.  And because subjective evaluations are not just objective reports of life-events, they depend on mental processes. The problem is that a research program that uses experimental manipulations does not tell us about the mental processes that are underlying life-satisfaction judgments when participants are not manipulated.

 

[WHAT MIGHT HAVE BEEN: COUNTERFACTUALS]

 

Counterfactual thinking can influence affect and subjective well-being in several ways (see Roese 1997; Roese and Olson 1995b).

 

Yes, it can, it may, and it might, but the real question is whether it does influence well-being reports and if so, how it influences these reports.

 

For example, winners of Olympic bronze medals reported being more satisfied than silver medalists (Medvec, Madey, and Gilovich 1995), presumably because for winners of bronze medals, it is easier to imagine having won no medal at all (a “downward counterfactual”), while for winners of silver medals, it is easier to imagine having won the gold medal (an “upward counterfactual”).

 

This is not an accurate summary of the article that contained three studies.  Study 1 used ratings of video clips of Olympic medalists immediately after the event (23 silver & 18 bronze medalists).  The study showed a strong effect that bronze medalists were happier than silver medalists, F(1,72) = 18.98.  The authors also noted that in some events the silver medal means that an athlete lost a finals match, whereas in other events they just placed second in a field of 8 or more athletes.  An analysis that excluded final matches showed weaker evidence for the effect, F(1,58) = 6.70.  Most important, this study did not include subjective reports of satisfaction as claimed in the review article. Study 2 examined interviews of 13 silver and 9 bronze medalists.  Participants in Study 2 rated interviews of silver medal athletes to contain more counter-factual statements (e.g., I almost), t(20) = 2.37, p< ,03.  Importantly, no results regarding satisfaction are reported. Study 3 actually recruited athletes for a study and had a larger sample size (N = 115). Participants were interviewed by the experimenters after they won a silver or bronze medal at an athletic completion (not the Olympics).   The description of the procedure is presented verbatim here.

 

Procedure. The athletes were approached individually following their events and asked to rate their thoughts about their performance on the same 10-point scale used in Study 2. Specifically, they were asked to rate the extent to which they were concerned with thoughts of “At least I. . .” (1) versus “/ almost” (10). Special effort was made to ensure that the athletes understood the scale before making their ratings. This was accomplished by mentioning how athletes might have different thoughts following an athletic competition, ranging from “I almost did better” to “at least I did this well.”

 

What is most puzzling about this study is why the experiments seemingly did not ask questions about emotions or satisfaction with performance.  It would have taken only a couple of questions to obtain reports that speak to the question of the article whether winning a silver medal is subjectively better than winning a bronze medal.  Alas, these questions are missing. The only result from Study 3 is “as predicted, silver medalists’ thoughts following the competition were more focused on “I almost” than were bronze medalists’.  Silver medalists described their thoughts with a mean rating of 6.8 (SD = 2.2), whereas bronze medalists assigned their thoughts an average rating of 5.7 (SD = 2.7), t(113) = 2.4, p < .02.

 

In sum, there is no evidence in this study that winning an Olympic silver medal or any other silver medal for that matter makes athletes less happy than winning a bronze medal. The misrepresentation of the original study by Schwarz and Strack is another example of unscientific practices that can lead to the fabrication of false facts that are difficult to correct and can have a lasting negative effect on the creation of a cumulative science.

 

In summary, judgments of SWB can be profoundly influenced by mental constructions of what might have been.

 

This statement is blatantly false. The cited study on medal winners does not justify this claim and thre is no scientific basis for the claim that these effects are profound.

 

In combination, the discussion in the preceding sections suggests that nearly any aspect of one’s life can be used in constructing representations of one’s “life now” or a relevant standard, resulting in many counterintuitive findings.

 

A collection of selective findings that were obtained using different experimental procedures does not mean that well-being reports obtained under naturalistic conditions produce many counterintuitive findings, nor is there any evidence that they do produce many counterintuitive findings.  This statement lacks any empirical foundation and is inconsistent with other findings in the well-being literature.

 

Common sense suggests that misery that lasts for years is worse than misery that lasts only for a few days.

 

Indeed. Extended periods of severed depression can drive some people to attempt suicide. A week with the flu does not. Consistent with this common sense observation, well-being reports of depressed people are much lower than those of other people, once more showing that well-being reports often produce results that are consistent with intuitions.

 

Recent research suggests, however, that people may largely neglect the duration of the episode, focusing instead on two discrete data points, namely, its most intense hedonic moment (“peak”) and its ending (Fredrickson and Kahneman 1993; Varey and Kahneman 1992). Hence, episodes whose worst (or best) moments and endings are of comparable intensity are evaluated as equally (un)pleasant, independent of their duration (for a more detailed discussion, see Kahneman, this volume).

 

Yes, but this research focusses on brief episodes with a single emotional event.  It is interesting that duration of episodes seems to matter very little, but life is a complex series of events and episodes. Having sex for 20 minutes or 30 minutes may not matter, but having sex regularly, at least once a week, does seem to matter for couples’ well-being.  As Diener et al. (1985) noted, it is the frequency, not the intensity (or duration) of positive and negative events in people’s lives that matters.

 

Although the data are restricted to episodes of short duration, it is tempting to speculate about the possible impact of duration neglect on the evaluation of more extended episodes.

 

Yes, interesting, but this statement clearly indicates that the research on duration neglect is not directly relevant for well-being reports.

 

Moreover, retrospective evaluations should crucially depend on the hedonic value experienced at the end of the respective episode.

 

This is a prediction not a fact. I have actually examined this question and found that frequency of positive and negative events has a stronger influence on satisfaction judgments with a day than how respondents felt at the end of the day when they reported daily satisfaction.

 

[SUMMARY]

 

As our selective review illustrates, judgments of SWB are not a direct function of one’s objective conditions of life and the hedonic value of one’s experiences.

 

First, it is great that the authors acknowledge here that their review is selective.  Second, we do not need a review to know that subjective well-being is not a direct function of objective life conditions. The whole point of subjective well-being reports is to allow respondents to evaluate these events from their own subjective point of view.  And finally, at no point has this selective review shown that these reports do not depend on the hedonic value of one’s experiences. In fact, measures of hedonic experiences are strong predictors of life-satisfaction judgments (Schimmack et al., 2002; Lucas et al., 1996; Zou et al., 2012).

 

Rather they crucially depend on the information that is accessible at the time of judgment and how this information is used in constructing mental representations of the to-be-evaluated episode and a relevant standard.

 

This factual statement cannot be supported by a selective review of the literature. You cannot say, my selective review of creationist literature shows that evolution theory is wrong.  You can say that a selective review of creationist literature would suggest that evolution theory is wrong, but you cannot say that it is wrong. To make scientific statements about what is (highly probable to be) true and what is (highly probable to be) false, you need to conduct a review of the evidence that is not selective and not biased.

 

As a result of these construal processes, judgments of SWB are highly malleable and difficult to predict on the basis of objective conditions. 

 

This is not correct.  Evaluations do not directly depend on objective conditions. This is not a feature of well-being reports but a feature of evaluations.  At the same time, the construal processes that related objective events to subjective well-being are systematic, predictable, and depend on chronically accessible and stable information.  Well-being reports are highly correlated with objective characteristics of nations, bereavement, unemployment, and divorce have negative effects on well-being and winning the lottery, marriage, and remarriage have positive effects on well-being.  Schwarz and Strack are fabricating facts. This is not considered fraud. Only data manipulation and fabricating data is considered scientific fraud, but this does not mean that fabricated facts are less harmful than fabricated data.  Science can only provide a better understanding if it is based on empirically verified and replicable facts. Simply stating ‘judgments of SWB are difficult to predict” without providing any evidence for this claim is unscientific.

 

[USING INFORMATION ABOUT OTHERS: SOCIAL COMPARISONS]

 

The causal impact of comparison processes has been well supported in laboratory experiments that exposed respondents to relevant comparison standards…For example, Strack and his colleagues (1990) observed that the mere presence of a handicapped confederate was sufficient to increase reported SWB under self-administered questionnaire conditions, presumably because the confederate served as a salient standard of comparison….As this discussion indicates, the impact of social comparison processes on SWB is more complex than early research suggested. As far as judgments of global SWB are concerned, we can expect that exposure to someone who is less well off will usually result in more positive-and to someone who is better off in more negative assessments of one’s own life.  However, information about the other’s situation will not always be used as a comparison standard.

The whole section about social comparison does not really address the question of the influence of social comparison effects on well-being reports.  Only a single study with a small sample is used to provide evidence that respondents may engage in social comparison processes when they report their well-being.  The danger of this occurring in a naturalistic context is rather slim.  Even in face-to-face interviews, the respondent is likely to have answered several questions about themselves and it seems far-fetched that they would suddenly think about the interviewer as a relevant comparison standard, especially if the interviewer does not have a salient characteristic like a disability that may be considered relevant. Once more the authors generalize from one very specific laboratory experiment to the naturalistic context in which SWB reports are normally made without considering the possibility that the experimental results are highly contextual sensitive and do not reveal how respondents normally judge their lives.

[Standards Provided by the Social Environment]

In combination, these examples draw attention to the possibility that salient comparison standards in one’s immediate environment, as well as socially shared norms, may constrain the impact of fortuitous temporary influences. At present, the interplay of chronically and temporarily accessible standards on judgments of SWB has received little attention. The complexities that are likely to result from this interplay provide a promising avenue for future research.

Here the authors acknowledge that their program of research is limited and fails to address how respondents use chronically accessible information. They suggest that this is a promising avenue for future research, but they fail to acknowledge why they haven’t conducted studies that start to address this question. The reason is that their research program with experimental manipulations of the situation doesn’t allow to study the use of chronically accessible information.  The use of information that by definition comes to mind spontaneously independent of researchers’ experimental manipulations is a blind-spot of the experimental approach.

[Interindividual Standards Implied by the Research Instrument]

Finally, we extend our look at the influences of the research instrument by addressing a frequently overlooked source of temporarily accessible comparison information…As numerous studies have indicated (for a review, see Schwarz 1996), respondents assume that the list of response alternatives reflects the researcher’s knowledge of the distribution of the behavior: they assume that the “average” or “usual” behavioral frequency is represented by values in the middle range of the scale, and that the extremes of the scale correspond to the extremes of the distribution. Accordingly, they use the range of the response alternatives as a frame of reference in estimating their own behavioral frequently, resulting in different estimates of their own behavioral frequency, as shown in table 4.2. More important for our present purposes, they further extract comparison information from their low location on the scale…Similar findings have been obtained with regard to the frequency of physical symptoms and health satisfaction (Schwarz and Scheuring 1992), the frequency of sexual behaviors and marital satisfaction (Schwarz and Scheuring 1988), and various consumer behaviors (Menon, Raghubir, and Schwarz 1995).

One study is in German and not available.  I examined the study by Schwarz and Scheurig (1988) in European Journal of Social Psychology.   Study 1 had four conditions with n = 12 or 13 per cell (N = 51).  The response format varied frequencies so that having sex or masturbating once a week was either a high or low frequency occurrence.  Subsequently, participants reported their relationship satisfaction. The relationship satisfaction ratings were analyzed with an ANOVA.  “Analysis of variance indicates a marginally reliable interaction of both experimental variables, F(1,43) = 2.95, p < 0.10, and no main effects.”  The result is not significant by conventional standards and the degrees of freedom show that some participants were excluded from this analysis without further mentioning of this fact.  Study 2 manipulated the response format for frequency of sex and masturbation within subject. That is, all subjects were asked to rate frequencies of both behaviors in four different combinations. There were n = 16 per cell, N = 64. No ANOVA is reported presumably because it was not significant. However, a PLANNED contrast between the high sex/low masturbation and the low/sex high masturbation group showed a just significant result, t(58) = 2.17, p = .034. Again, the degrees of freedom do not match sample size. In conclusion, the evidence that subtle manipulations of response formats can lead to social comparison processes that influence well-being reports is not conclusive. Replication studies with larger samples would be needed to show that these effects are replicable and to determine how strong these effects are.

In combination, they illustrate that response alternatives convey highly salient comparison standards that may profoundly affect subsequent evaluative judgments.

Once, more the word “may” makes the statement true in a trivial sense that many things may happen. However, there is no evidence that these effects actually have profound effects on well-being reports, and the existing studies show statistically weak evidence and provide no information about the magnitude of these effects.

Researchers are therefore well advised to assess information about respondents’ behaviors or objective conditions in an open-response format, thus avoiding the introduction of comparison information that respondents would not draw on in the absence of the research instrument.

There is no evidence that this would improve the validity of frequency reports and research on sexual frequency shows similar results with open and closed measures of sexual frequency (Muise et al., 2016).

[SUMMARY]

In summary, the use of interindividual comparison information follows the principle of cognitive accessibility that WC have highlighted in our discussion of intraindividual comparisons. Individuals often draw on the comparison information that is rendered temporarily accessible by the research instrument or the social context in which they form the judgment, although chronically accessible standards may attenuate the impact of temporarily accessible information.

The statement that people often rely on interpersonal comparison standards is not justified by the research.  By design, experiments that manipulate one type of information and make it salient cannot determine how often participants use this type of information when it is not made salient.

[THE IMPACT OF MOOD STATES]

In the preceding sections, we considered how respondents use information about their own lives or the lives of others in comparison-based evaluation strategies. However, judgments of well-being are a function not only of what one thinks about but also of how one feels at the time of judgment.

Earlier, the authors stated that respondents are likely to use a minimum of information that is deemed sufficient. “Instead, they truncate the search process as soon as enough information has come to mind to form a judgment with sufficient subjective certainty (Bodenhausen and Wyer 1987)”  Now we are supposed to believe that they use intrapersonal and interpersonal information that is temporarily and chronically accessible and their feelings.  That is a lot of information and it is not clear how all of this information is combined into a single judgment. A more parsimonious explanation for the host of findings is that each experiment carefully created a context that made respondents use the information that the experimenters wanted respondents to use to confirm the hypothesis that they use this information. The problem is that this only shows that a particular source of information may be used in one particular context. It does not mean that all of these sources of information are used and need to be integrated into a single judgment under naturalistic conditions. The program of research simply fails to address the question which information respondents actually use when they are asked to judge their well-being in a normal context.

A wide range of experimental data confirms this intuition. Finding a dime on a copy machine (Schwarz 1987), spending time in a pleasant rather than an unpleasant room (Schwarz ct al. 1987, Expcrimcnt 2), or watching the German soccer team win rather than lose a championship game (Schwarz et al. 1987, Experiment 1) all resulted in increased reports of happiness and satisfaction with one’s life as a whole…Experimental evidence supports this assumption. For example, Schwarz and Clore (1983, Experiment 2) called respondents on sunny or rainy days and assessed reports of SWB in telephone interviews. As expected, respondents reported being in a better mood, and being happier and more satisfied with their life as a whole, on sunny rather than on rainy days. Not so, however, when respondents’ attention was subtly drawn to the weather as a plausible cause of their current feelings.

The problem is that all of the cited studies were conducted by Schwarz and that other studies that produced different results are not mentioned.  The famous weather study has recently been called into question.  However, the weather effect on life-satisfaction judgments is not ideal because weather effects on mood are not very strong either.  Respondents in sunny California do not report higher life-satisfaction than respondents in Ohio (Schkade & Kahneman), and several large scale studies have now failed to replicate the famous weather effect on well-being reports (Lucas & Lawless, 2013; Schmiedeberg, 2014).

On theoretical grounds, we may assume that people are more likely to use the simplifying strategy of consulting their affective state the more burdensome it would be to form a judgment on the basis of comparison information.

Here it is not clear why it would be burdensome to make global life-satisfaction judgments. The previous chapters suggested that respondents have access to large amount of chronically and temporarily information that they apparently used in the previous studies. Suddenly, it is claimed that retrieving relevant information is too hard and mood is used. It is not clear why respondents would consider their current mood sufficient to evaluate their lives, especially if inconsistent accessible information also comes to mind.

Note in this regard that evaluations of general life satisfaction pose an extremely complex task that requires a large number of comparisons along many dimensions with ill-defined criteria and the subsequent integration of the results of these comparisons into one composite judgment. Evaluations of specific life domains, on the other hand, are often less complex.

If evaluations of specific life domains are less complex and global questions are just an average of specific domains, it is not clear why it would be so difficult to evaluate satisfaction in a few important life domains (health, family, work) and integrate this information.  The hypothesis that mood is only used as a heuristic for global well-being reports also suggests that it would be possible to avoid the use of this heuristic by asking participants to report satisfaction with specific life domains. As these questions are supposed to be easier to answer, participants would not use mood. Moreover, preceding items are less likely to make information accessible that is relevant for a specific life domain.  For example, a dating question is irrelevant for satisfaction with academic or health satisfaction.  Thus, participants are most likely to draw on chronically accessible information that is relevant for answering a question about satisfaction with specific domains. It follows that averages of domain satisfaction judgments would be more valid than global judgments, if participants were relying on mood to judge global judgments. For example, finding a dime would make people judge their lives more positively, but not their health, social relationships, and income.  Thus, many of the alleged problems with global well-being reports could be avoided by asking for domain specific reports and then aggregate them (Andrews & Whithey, 1976; Zou et al., 2013).

If judgments of general well-being are based on respondents’ affective state, whereas judgments of domain satisfaction are based on comparison processes, it is conceivable that the same event may influence evaluations of one’s life as a whole and evaluations of specific domains in opposite directions. For example, an extremely positive event in domain X may induce good mood, resulting in reports of increased global SWB. However, the same event may also increase the standard of comparison used in evaluating domain X, resulting in judgments of decreased satisfaction with this particular domain. Again, experimental evidence supports this conjecture. In one study (Schwarz ct al. 1987, Experiment 2), students were tested in cither a pleasant or an unpleasant room, namely, a friendly office or a small, dirty laboratory that was overheated and noisy, with flickering lights and a bad smell. As expected, participants reported lower general life satisfaction in the unpleasant room than in the pleasant room, in line with the moods induced by the experimental rooms. In contrast, they reported higher housing satisfaction in the unpleasant than in the pleasant room, consistent with the assumption that the rooms served as salient standards of comparison.

The evidence here is a study with 22 female students assigned to two conditions (n = 12 and 10 per condition).  The 2 x 2 ANOVA with room (pleasant vs. unpleasant) and satisfaction judgment (life vs. housing) produced a significant interaction of measure and room, F(1,20) = 7.25, p = .014.  The effect for life-satisfaction was significant, F(1,20) = 8.02, p = .010 (reported as p < .005), and not significant for housing satisfaction, F(1,20) = 1.97, p = .18 (reported as p < .09 one-tailed).

This weak evidence in a single study with a very small sample is used to conclude that life-satisfaction judgments and domain satisfaction judgments may diverge.  However, numerous studies have shown high correlations between average domain satisfaction judgments and global life-satisfaction judgments (Andrews & Whithey, 1976; Schimmack & Oishi, 2005; Zou et al., 2013).  This finding cannot occur if respondents use mood for life-satisfaction judgments and other information for domain satisfaction judgments.  Yet readers are not informed about this finding that undermines Schwarz and Stracks’ model of well-being reports and casts doubt on the claim that the same information has opposite effects on global life-satisfaction judgments and domain specific judgments. This may happen in highly artificial laboratory conditions, but it does not happen often in normal survey contexts.

The Relative Salience of Mood and Competing Information

If recalling a happy or sad life event elicits a happy or sad mood at the time of recall, however, respondents are likely to rely on their feelings rather than on recalled content as a source of information. This overriding impact of current feelings is likely to result in mood-congruent reports of SWB, independent of the mental construal variables discussed earlier. The best evidence for this assumption comes from experiments that manipulated the emotional involvement that subjects experienced while thinking about past life events.

This section introduces a qualification of the earlier claim that recall of events in the remote past leads to a contrast effect.  Here the claim is that recalling a positive event from the remote past (a happy time with a deceased spouse) will not lead to a contrast effect (intensify dissatisfaction of a bereaved person), if the recall of the event triggers an actual emotional experiences (My life these days is good because I feel good when I think about the good times in the past).  The problem with this theory is that it is inconsistent with the earlier claims that people will discount their current feelings if they think they are irrelevant. If respondents do not use mood to judge their lives when they attribute it to the weather, it is not clear why they would use their feelings if they are triggered by recall of an emotional event from their past?  Why would a widower evaluate his current life as a widower more favorably when he is recalling the good times with his wife?

Even if this were a reliable finding, it would be practically irrelevant for actual ratings of life-satisfaction because respondents are unlikely to recall specific events in sufficient detail to elicit strong emotional reactions.  The studies that demonstrated the effect instructed participants to do so, but under normal circumstances participants make judgments very quickly often without recall of detailed, specific emotional episodes.  In fact, even the studies that showed these effects showed only weak evidence that recall of emotional events had notable effects on mood (Strack et al.. 1985).

[REPORTING THE JUDGMENT]

Self-presentation and social desirability concerns may arise at the reporting stage, and respondents may edit their private judgment before they communicate it

True. All subjective ratings are susceptible to reporting styles. This is why it is important to corroborate self-ratings of well-being with other evidence such as informant ratings of well-being.  However, the problem of reporting biases would be irrelevant, if the judgment without these biases is already valid. A large literature on reporting biases in general shows that these biases account for a relatively small amount of the total variance in ratings. Thus, the key question remains whether the remaining variance provides meaningful information about respondents’ subjective evaluations of their lives or whether this variance reflects highly unreliable and context-dependent information that has no relationship to individuals’ subjective well-being.

[A JUDGMENT MODEL OF SUBJECTIVE WELL-BEING]

Figure 4.2 summarizes the processes reviewed in this chapter. If respondents are asked to report their happiness and satisfaction with their “life as a whole,” they are likely to base their judgment on their current affective state; doing so greatly simplifies the judgmental task.

As noted before, this would imply that global well-being reports are highly unstable and strongly correlated with measures of current mood, but the empirical evidence does not support these predictions.  Current mood has a small effect on global well-being reports (Eid & Diener, 2004) and they are highly stable (Schimmack & Oishi, 2005) and predicted by personality traits even when these traits are measured a decade before the well-being reports (Costa & McCrae, 1980).

If the informational value of their affective state is discredited, or if their affective state is not pronounced and other information is more salient, they are likely to use a comparison strategy. This is also the strategy that is likely to be used for evaluations of less complex specific life domains.

Schwarz and Strack’s model would allow for weak mood effects. We only have to make the plausible assumption that respondents often have other information to judge their lives and that they find this information more relevant than their current feelings.  Therefore, this first stage of the judgment model is consistent with evidence that well-being judgments are only weakly correlated with mood and highly stable over time.

When using a comparison strategy, individuals draw on the information that is chronically or temporarily most accessible at that point in time. 

Apparently the term “comparison strategy” is now used to refer to the retrieval of any information rather than an active comparison that takes place during the judgment process.  Moreover, it is suddenly equally plausible that participants draw on chronically accessible information or on temporarily accessible information.  While the authors did not review evidence that would support the use of chronically accessible information, their model clearly allows for the use of chronically accessible information.

Whether information that comes to mind is used in constructing a representation of the target  “my life now” or a representation of a relevant standard depends on the variables that govern the use of information in mental construal (Schwarz and Bless 1992a; Strack 1992). 

This passage suggests that participants have to go through the process of evaluating their live each time when they are asked to make a well-being report. They have to construct what their live is like, what they want from life, and make a comparison. However, it is also possible that they can draw on previous evaluations of life domains (e.g., I hate my job, I am healthy, I love my wife, etc.). As life-satisfaction judgments are made rather quickly within a few seconds, it seems more plausible that some pre-established evaluations are retrieved than to assume that complex comparison processes are being made at the time of judgments.

If the accessibility of information is due to temporary influences, such as preceding questions in a questionnaire, the obtained judgment is unstable over time and a different judgment will be obtained in a different context.

This statement makes it obvious that retest correlations provide direct evidence on the use of temporarily accessible information.  Importantly, low retest stability could be caused by several factors (e.g. random responding).  So, we cannot verify that participants rely on temporarily accessible information when retest correlations are low. However, we can use high retest stability to falsify the hypothesis that respondents rely heavily on temporarily accessible information because the theory makes the opposite prediction.  It is therefore highly relevant that retest correlations show high temporal consistency in global well-being reports.  Based on this solid empirical evidence we can infer that responses are not heavily influenced by temporarily accessible information (Schimmack & Oishi, 2005).

On the other hand, if the accessibility of information reflects chronic influences such as current concerns or life tasks, or stable characteristics of the social environment, the judgment is likely to be less context dependent.

This implies that high retest correlations are consistent with the use of chronically accessible information, but high retest correlations do not prove that participants use chronically accessible information. It is also possible that stable variance is due to reporting styles. Thus, other information is needed to test the use of chronically accessible information. For example, agreement in well-being reports by several raters (self, spouse, parent, etc.) cannot be attributed to response styles and shows that different raters rely on the same chronically accessible information to provide well-being reports (Schneider & Schimmack, 2012).

The size of context-dependent assimilation effects increases with the amount and extremity of the temporarily accessible information that is included in the representation of the target. 

This part of the model would explain why experiments and naturalistic studies often produce different results. Experiments make temporarily accessible information extremely salient, which may lead participants to use it. In contrast, such extremely salient information is typically absent in naturalistic studies, which explains why chronically accessible information is used. The results are only inconsistent if results from experiments with extreme manipulations are generalized to normal contexts without these extreme conditions.

[METHODOLOGICAL IMPLICATIONS]

Our review emphasizes that reports of well-being are subject to a number of transient influences. 

This is correct. The review emphasized evidence from the authors’ experimental research that showed potential threats to the validity of well-being judgments. The review did not examine how serious these threats are for the validity of well-being judgments.

Although the information that respondents draw on reflects the reality in which they live, which aspects of this reality they consider and how they use these aspects in forming a judgment is profoundly influenced by features of the research instrument.

This statement is blatantly false.  The reviewed evidence suggests that the testing situation (a confederate, a room) or an experimental manipulation (recall positive or negative events) can influence well-being reports. There was very little evidence that the research instrument influenced well-being reports and there was no evidence that these effects are profound.

[Implications for Survey Research]

The reviewed findings have profound methodological implications.

This is wrong. The main implication is that researchers have to consider a variety of potential threats to the validity of well-being judgments. All of these threats can be reduced and many survey studies do take care to avoid some of these potential problems.

First, the obtained reports of SWB are subject to pronounced question-order effects because the content of preceding questions influences the temporary accessibility of relevant information.

As noted earlier, this was only true in two studies by the authors. Other studies do not replicate this finding.

Moreover, questionnaire design variables, like the presence or absence of a joint lead-in to related questions, determine how respondents use the information that comes to mind. As a result, mean reported well-being may differ widely, as seen in many of the reviewed examples

The dramatic shifts in means are limited to experimental studies that manipulated lead-ins to demonstrate these effects. National representative surveys show very similar means year after year.

Moreover, the correlation between an objective condition of life (such as dating frequency) and reported SWB can run anywhere from r = – .l to r = .6, depending on the order in which the same questions are asked (Strack et al. 1988), suggesting dramatically different substantive conclusions.

Moreover?  This statement just repeats the first false claim that question order has profound effects on life-satisfaction judgments.

Second, the impact of information that is rendered accessible by preceding questions is attenuated the more the information is chronically accessible (see Schwarz and Bless 1992a).

So, how can we see pronounced item-order effects for marital satisfaction if marital satisfaction is a highly salient and chronically accessible aspects of married people’s lives? So, this conclusion directly undermines the previous claim that item-order has profound effects.

Third, the stability of reports of SWB over time (that is, their test-retest reliability) depends on the stability of the context in which they are assessed. The resulting stability or change is meaningful when it reflects the information that respondents spontaneously consider because the same, or different, concerns are on their mind at different points in time. 

There is no support for this claim. If participants draw on chronically accessible information, which the authors’ model allows, the judgments do not depend on the stability of the context because chronically accessible information is by definition context-independent.

Fourth, in contrast to influences of the research instrument, influences of respondents’ mood at the time of judgment are less likely to result in systematic bias. The fortuitous events that affect one respondent’s mood are unlikely to affect the mood of many others.

This is true, but it would still undermine the validity of the judgments.  If participants rely on their current mood, variation in these responses will be unreliable and unreliable measures are by definition invalid. Moreover, the average mood of participants during the time of a survey is also not a valid measure of average well-being. So, even though mood effects may not be systematic, they would undermine the validity of well-being reports. Fortunately, there is no evidence that mood has a strong influence on these judgments, while there is evidence that participants draw on chronically accessible information from important life domains (Schimmack & Oishi, 2005).

Hence, mood effects are likely to introduce random variation.

Yes, this is a correct prediction, but evidence contradicts this prediction, and the correct conclusion is that mood does not introduce a lot of random variation in well-being reports because it is not heavily used by respondents to evaluate their lives or specific aspects of their lives.

Fifth, as our review indicates, there is no reason to expect strong relationships between the objective conditions of life and subjective assessments of well-being under most circumstances.

There are many reasons not to expect strong correlations between life-events and well-being reports. One reason is that a single event is only a small part of a whole life and that few life events have such dramatic effects on life-satisfaction that they make any notable contribution to life-satisfaction judgments.  Another reason is that well-being is subjective and the same life event can be evaluated differently by different individuals. For example, the publication of this review in a top journal in psychology would have different effects on my well-being and on the well-being of Schwarz and Strack.

Specifically, strong positive relationships between a given objective aspect of life and judgments of SWB are likely to emerge when most respondents include the relevant aspect in the representation that they form of their life and do not draw on many other aspects. This is most likely to be the case when (a) the target category is wide (“my life as a whole”) rather than narrow (a more limited episode, for example); (b) the relevant aspect is highly accessible; and (c) other information that may be included in the representation of the target is relatively less accessible. These conditions were satisfied, for example, in the Strack, Martin, and Schwarz (1988) dating frequency study, in which a question about dating frequency rendered this information highly accessible, resulting in a correlation of r = .66 with evaluations of the respondent’s life as a whole. Yet, as this example illustrates, we would not like to take the emerging correlation seriously when it reflects only the impact of the research instrument, as indicated by the fact that the correlation was r = – .l if the question order was reversed.

The unrepresentative extreme result from Strack’s study is used again as evidence, when other studies do not show the effect (Schimmack & Oishi, 2005).

Finally, it is worth noting that the context effects reviewed in this chapter limit the comparability of results obtained in different studies. Unfortunately, this comparability is a key prerequisite for many applied uses of subjective social indicators, in particular their use in monitoring the subjective side of social change over time (for examples see Campbell 198 1; Glatzer and Zapf 1984).

This claim is incorrect. The experimental demonstrations of effects under artificial conditions that were needed to manipulate judgment processes do not have direct implications for the way participants actually judge well-being and the authors model allows for chronically accessible information to have a strong influence on these judgments under less extreme and less artificial conditions, and the authors model makes predictions that are disconfirmed by evidence of high stability and low correlations with mood.

Which Measures Are We to Use?

By now, most readers have probably concluded that there is little to be learned from self-reports of global well-being being.

If so, the authors succeeded with their biased presentation of the evidence to convince readers that these reports are highly susceptible to a host of context effects that make the outcome of the judgment process essentially unpredictable. Readers would be surprised to learn that well-being reports of twins who never met are positively correlated (Lykken & Tellgen, 1996).

Although these reports do reflect subjectively meaningful assessments, what is being assessed, and how, seems too context dependent to provide reliable information about a population’s well- being, let alone information that can guide public policy (but see Argyle, this volume, for a more optimistic take).

The claim that well-being reports are too context dependent to provide reliable information about a population’s well-being is false for several reasons.  First, the authors did not show that well-being reports are context dependent. They showed that with very extreme manipulations in highly contrived and unrealistic contexts, judgments moved around statistically significantly in some studies.  They did not show that these shifts are large, as it would require larger samples to estimate effect sizes. They did not show that these effects have a notable influence on well-being reports in actual surveys of populations well-being.  And finally, they already pointed out that some of these effects (e.g., mood effects) would only add random noise, which would lower the reliability of individuals’ well-being reports, but when aggregated across responses would not alter the mean of a sample. And last, but not least, the authors blatantly ignore evidence (reviewed in this volume by Diener and colleagues) that variation across nationally representative samples shows highly reliable variation across populations in different nations that are highly correlated with objective life circumstances that are correlated with nations’ wealth.

In short, Schwarz and Strack’s claims are not scientifically founded and merely express the authors’ pessimistic take on the validity of well-being reports.  This pessimistic view is a direct consequence of a myopic focus on laboratory experiments that were designed to invalidate well-being reports and ignoring evidence from actual well-being surveys that are more suitable to examine the reliability and validity of well-being reports when well-being reports are provided under naturalistic conditions.

As an alternative approach, several researchers have returned to Bentham’s ( 1789/1948) notion of happiness as the balance of pleasure over pain (for examples, SW Kahneman, this volume; Parducci 1995).

This statement ignores the important contribution of Diener (1984) who argued that the concept of well-being may consist of life evaluations as well as the balance of pleasure over pain or Positive Affect and Negative Affect, as these constructs are called in contemporary psychology. As a result of Diener’ s (1984) conception of well-being as a construct with three components, researchers have routinely measured global life-evaluations along with measures of positive and negative affect. A key finding is that these measures are highly correlated, although not perfectly identical (Lucas et al., 1996; Zou et al., 2013).  Schwarz and Strack ignore this evidence, presumably because it would undermine their view that global life-satisfaction judgments are highly context sensitive and that measures of positive and negative affect could produce notably different results.

END OF REVEW: CONCLUSIONS

In conclusion, Schwarz and Strack’s (1999) chapter is a prototypical example of several bad scientific practices.  First, the authors conduct a selective review of the literature that focuses on one specific paradigm and ignores evidence from other approaches.  Second, the review focuses strongly on original studies conducted by the authors themselves and ignores studies by other researchers that produced different results. Third, the original studies are often obtained with small samples and there are no independent replications by other researchers, but the results are discussed as if they are generalizable.  Fourth, life-satisfaction judgments are influenced by a host of factors and any study that focuses on one possible predictor of these judgments is likely to account for only a small amount of the variance. Yet, the literature review does not take effect sizes into account and the theoretical model overemphasizes the causes that were studied and ignores causes that were not studied.  Fifth, the experimental method has the advantage of isolating single causes, but it has the disadvantage that results cannot be generalized to ecologically valid contexts in which well-being reports are normally obtained. Nevertheless, the authors generalize from artificial experiments to the typical survey context without examining whether their predictions are confirmed.  Finally, the authors make broad and profound claims that do not logically follow from their literature review. They suggest that decades of research with global well-being reports can be dismissed because the measures are unreliable, but these claims are inconsistent with a mountain of evidence that shows the validity of these measures that the authors willfully ignore (Diener et al., 2009).

Unfortunately, the claims in this chapter were used by Noble Laureate Daniel Kahneman as arguments to push for an alternative conception and measurement of well-being.  In combination, the unscientific review of the literature and the political influence of a Noble price has had a negative influence on well-being science.  The biggest damage to the field has been the illusion that the processes underlying global well-being reports are well-understood. In fact, we know very little how respondents make these judgments and how accurate these judgments are.  The chapter lists a number of possible threats to the validity of well-being reports, but it is not clear how much these threats actually undermine the validity of well-being reports and what can be done to reduce biases in these measures to improve their validity.  A program that relies exclusively on experimental manipulations that create biases in well-being reports is unable to answer these questions because well-being judgments can be made in numerous ways and results that are obtained in artificial laboratory contexts may or may not generalize to the context that is most relevant, namely when well-being reports are used to measure well-being.

What is needed is a real scientific program of research that examines accuracy and biases in well-being reports and creates well-being measures that maximize accuracy and minimize biases. This is what all other sciences do when they develop measures of theoretical important constructs. It is time for well-being researchers to act like a normal science. To do so, research on well-being reports needs a fresh start and needs an objective and scientific review of the empirical evidence regarding the validity of well-being measures.

Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article “Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach”

Dr. R responds to Finkel, Eastwick, & Reis (FER)’s article: Replicability and Other Features of a High-Quality Science: Toward a Balanced and Empirical Approach

My response is organized as a commentary on key sections of the article. The sections of the article are direct quotations to give readers quick and easy access to FER’s arguments and conclusions, followed by my comments.  The quotations are printed in bold.

Here, we extend FER2015’s analysis to suggest that much of the discussion of best research practices since 2011 has focused on a single feature of high-quality science—replicability—with insufficient sensitivity to the implications of recommended practices for other features, like discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.

I see replicability as being equivalent to the concept of reliability in psychological measurement.  Reliability is necessary for validity, which means a measure needs to be reliable to produce valid results and this includes internal validity and external validity. And valid results are needed to create a solid body of research that provides the basis for a cumulative science.

Take life-satisfaction judgments as an example. In a review article, Schwarz and Strack (1999) claimed that life-satisfaction judgments are unreliable, extremely sensitive to context, and that responses can change dramatically as a function of characteristic of the survey questions.  Do we think, a measure with low reliability can be used to study well-being and to build a cumulative science of well-being? No. It seems self-evidence that reliable measures are better than unreliable measures.

The reason why some measures are not reliable is that scores on the measure are influenced by factors that are difficult or too expensive to control.  As a result, these factors have an undesirable effect on responses. The effect is not systematic, or it is too difficult to study the systematic effects, and therefore results will randomly change when the same measure is used again and again.  We can assess the influence of these random factors by administering the same measurement procedure again and again and see how much scores change (in the absence of real change).

The same logic applies to replicability.  Replicability means that we get the same result if we repeat a study again and again.  Just like scores on a psychological measure can change, the results of even exact replication studies will not be the same. The reason is the same. Random factors that are outside the control of the experimenter influences the results that are obtained in a single study.  Hence, we cannot expect that exact replication studies will always produce the same results.  For example, the gender ratio in a psychology class will not be the same year after year, even if there is no real change in the gender ratio of psychology students over time.

So what does it even mean for a result to be replicable; that is for a replication study to produce the same result as the original study?  It depends on the interpretation of the results of an original study.  A professor interested in gender composition of psychology could compute the gender ratio for each year. The exact number would vary from year to year.  However, the researcher could also compute a 95% confidence interval around these numbers.  This interval specifies the amount of variability that is expected by chance.  We may then say that a study is replicable if subsequent studies produce results that are compatible with the 95% confidence interval of the original studies.  In contrast, low replicability would mean that results vary from study to study.  For example, in one year the gender ratio is 70% female (+/- 10% 95% CI), in the next year it is 25% female (again +/-10%), and the following year it is 99% (+/- 10%).  In this case, the gender ratio jumps around dramatically and the result from one study cannot be used to predict gender ratios in other years, and provides no solid empirical foundations for theories of the effect of gender on interest in psychology.

Using this criterion of replicability, many results in psychology are highly replicable.  The problem is that using this criterion, many results in psychology are also not very informative because effect sizes tend to be relatively small compared to the width of confidence intervals (Cohen, 1994).  With a standardized effect size of d = .4, and a typical confidence interval width of d ~ 1 (se = .25), the typical finding in psychology is that the effect size ranges from -.1 to d = 0.9. This means the typical result is consistent with a small effect in the opposite direction from the one in the sample (chocolate eating leads to weight gain, even if my study shows that chocolate eating leads to weight loss) and very large effects in the same direction (chocolate eating is a highly effective way of losing weight). Most important, the result is also consistent with the null-hypothesis (chocolate eating has no effect on weight; which in this case would be a sensational and important finding that would make Willy Wonka very happy).  I hope this example makes the point that it is not very informative to conduct studies of small effect sizes with wide confidence intervals because we do not learn much from these studies. Mostly, we are not more informed about a research question after looking at the data than we were before we looked at the data.

Not surprisingly, psychology journals do not publish findings like d = .2 +/- .8.  The typical criterion for reporting a newsworthy result is that the confidence interval falls into one of two regions. The region of effect sizes less than zero or the region of effect sizes greater than zero.  If the 95%CI falls in one of these two regions, it is possible to say that there is only a maximum error rate of 5%, when we infer from a confidence interval in the positive region that the actual effect size is positive, and from a confidence interval in the negative region that the actual effect size is negative.  In other words, it wasn’t just those random factors that produced a positive effect in a sample when the actual effect size is 0 or negative and it wasn’t just random factors that produced a negative effect when the actual effect size is 0 or positive.  To examine whether the results of a study provide sufficient information to claim that an effect is real and not just due to random factors, researchers compute p-values and check whether the p-value is less than 5%.

If the original study, reported a significant result to make inferences about the direction of an effect, and replicability is defined as obtaining the same result, replicability means that we obtain a significant result again in the replication study.  The famous statistician Sir Ronald Fisher made replicability a criterion for a good study. “A properly designed experiment rarely fails to give … significance” (Fisher, 1926, p. 504).

What are the implications of replication studies that do not replicate a significant result?  These studies are often called failed replication studies, but this term is unfortunate because the study was not a failure.  Maybe we might want to call these studies unsuccessful replication studies, although I am not sure this term is much better.  The problem with unsuccessful replication studies is that there are a host of reasons why a replication study might fail.  This means, additional research is needed to uncover why the original study and the replication study produced different results. In contrast, if a series of studies produces significant results, it is highly likely that the result is a real finding and can be used as an empirical foundation for theories.  For example, the gender ratio in my PSY230 course is always significantly different from a 50/50 split that we might expect if both genders were equally interested in psychology. This shows that my study that recorded the gender of students and compared the ratio of men and women against a fixed probability of 50% meets at least one criterion of a properly designed experiment, namely it rarely fails to reject the null-hypothesis.

In short, it is hard to argue with the proposition that replicability is an important criterion for a good study.  If study results cannot be replicated, it is not clear whether a phenomenon exists, and if it is not clear whether a phenomenon exists, it is impossible to make theoretical predictions about other phenomena based on this phenomenon.  For example, we cannot predict gender differences in professions that require a psychology degree if we do not have replicable evidence that there is a gender difference in psychology students.

The present analysis extends FER2015’s “error balance” logic to emphasize tradeoffs among features of a high-quality science (among scientific desiderata). When seeking to optimize the quality of our science, scholars must consider not only how a given research practice influences replicability, but also how it influences other desirable features.

A small group of social relationship researchers (Finkel, Eeastwick, & Reis; henceforce FER) are concerned about the recent shift in psychology from a scientific discipline that ignored replicability entirely to a field that actually cares about the replicability of results published in original research articles.  Although methodologists have criticized psychology for a long time, it was only after Bem (2011) published extraordinarily unbelievable results that psychologists finally started to wonder how replicable published results actually are.  In response to this new focus on replicability, several projects have conducted replication studies with shocking results. In FER’s research area, replicability is estimated to be as low as 25%. That is, three-quarter of published results are not replicable and require further research efforts to examine why original studies and replication studies produced inconsistent results.  In a large-scale replication study, one of the authors original findings failed to replicate and the replication studies cast doubt on theoretical assumptions about the determinants of forgiveness in close relationships.

FER ask “Should Scientists Consistently Prioritize Replicability Above Other Core Features?”

As FER are substantive researchers with little background in research methodology, it may be understandable that they do not mention important contributions by methodologists like Jacob Cohen.  Cohen’s answer is clear.  Less is more, except for sample size.  This statement makes it clear that replicability is necessary for a good study.  According to Cohen a study design can be perfect in many ways (e.g., elaborate experimental manipulation of real-world events with highly valid outcome measures), but if the sample size is small (e.g., N = 3), the study simply cannot produce results that can be used as an empirical foundation for theories.  If a study cannot reject the null-hypothesis with some degree of confidence, it is impossible to say whether there is a real effect or whether the result was just caused by random factors.

Unfazed by their lack of knowledge about research methodology, FER take a different view.

In our view, the field’s discussion of best research practices should revolve around how we prioritize the various features of a high-quality science and how those priorities may shift across our discipline’s many subfields and research contexts.

Similarly, requiring very large sample sizes increases replicability by reducing false-positive rates and increases cumulativeness by reducing false-negative rates, but it also reduces the number of studies that can be run with the available resources, so conceptual replications and real-world extensions may remain unconducted.

So, who is right. Should researchers follow Cohen’s advice and conduct a small number of studies with large samples or is it better to conduct a large number of studies with small samples? If resources are limited and a researcher can collect data from 500 participants in one year.  Should the researcher conduct one study with N = 500, five studies with N = 100, or 25 studies with N = 20?  FER suggest that we have a trade-off between replicability and discoveries.

Also, large sample size norms and requirements may limit the feasibility of certain sorts of research, thereby reducing discovery.

This is true, if we consider true and false discoveries as discoveries (FER do not make a distinction).  Bem (2011) discovered that human minds can time travel. This was a fascinating discovery, yet it was a false discovery. Bem (2001) himself advocated the view that all discoveries are valuable, even false discoveries (Let’s err on the side of discovery.).  Maybe FER learned about research methods from Bem’s chapter.  Most scientists and lay people, however, value true discoveries over false discoveries.  Many people would feel cheated if the Moon landing was actually faked, for example, and if billions spent on cancer drugs are not helping to fight cancer (it really was just eating garlic).  So, the real question is whether many studies with small samples produce more true discoveries than a single study with a large sample.

This question was examined in LeBell, Campbell, and Loving (2015), who concluded largely in favor of Cohen’s recommendation that a slow approach with fewer studies and high replicability is advantageous for a cumulative science.

For example, LCL2016’s Table 3 shows that the N-per-true discovery decreases from N=1,742 when the original research is statistically powered at 25% to N=917 when the original research is statistically powered at 95%.

FER criticize that LCL focused on efficient use of resources for replication studies and ignored the efficient use of resources for original researcher.  As many researchers are often doing more than one study on a particular research question, the distinction between original researcher and replication researcher is artificial. Ultimately, researchers may conduct a number of studies. The studies can be totally new, conceptual replications of previous studies, or exact replications of previous studies. A significant result always will be used to claim a discovery. When a non-significant result contradicts a previous significant result, discovery, additional research is needed to examine whether the original result was a false discovery or whether the replication result was a false negative.

FER observe that “original researchers will be more efficient (smaller N-per-true discovery) when they prioritize lower-powered studies. That is, when assuming that an original researcher wishes to spend her resources efficiently to unearth many true effects, plans never to replicate her own work, and is insensitive to the resources required to replicate her studies, she should run many weakly powered studies.”

FER may have discovered why some researchers, including themselves, pursue a strategy of conducting many studies with relatively low power.  It produces many discoveries that can be published.  They also produce many non-significant results that do not lend to a discovery. But the absolute number of true discoveries is still likely to be greater than the 1 true discovery by a researcher who conducted only one study.  The problem is that the researchers are also likely to make more false discoveries than the researcher who conducts only one study.  They just make more discoveries, true discoveries and false discoveries, and replication studies are needed to examine whether the results are true discoveries or false discoveries.  When other researchers conduct replication studies and fail to replicate an effect, further resources are needed to examine why the original study produced a non-significant result. However, this is not a problem for discoverers who are only in the business of testing new and original hypothesis and reporting those that produced a significant result and leave it to other researchers to examine which of these discoveries are true or false.  These researchers were rewarded handsomely in the years before Bem (2011) because nobody wanted to be in the business of conducting replication studies. As a result, all discoveries produced by original researchers were treated as if they would replicate and researchers with a high number of discoveries were treated as researchers with more true discoveries. There just was no distinction between true and false discoveries and it made sense to err on the side of discovery.

Given the conflicting efficiency goals between original researchers and replicators, whose goals shall we prioritize?

This is a bizarre question.  The goal of science is to uncover the truth and to create theories that rest on a body of replicable, empirical findings.  Apparently, this is not the goal of original researchers.  Their goal is to make as many discoveries as possible and to leave it to replicators to test which of these discoveries are replicable or not.  This division is not very appealing and few scientists want to be the maid of original scientists and clean up their mess when they do cooking experiments in the kitchen.  Original researchers should routinely replicate their own results and when they do so with small studies, they suddenly face the problem of replicators that they end up with non-significant results and now have to conduct further studies to uncover the reasons for these discrepancies.  FER seem to agree.

We must prioritize the field’s efficiency goals rather than either the replicator’s or the original researcher’s in isolation. The solid line in Figure 2 illustrates N-per-true-discovery from the perspective of the field—when the original researcher’s 5,000 participants are added to the pool of participants used by the replicator. This line forms a U-shaped pattern, suggesting that the field will be more efficient (smaller N-per true-discovery) when original researchers prioritize moderately powered studies).

This conclusion is already implied in Cohen’s power calculations.  The reason is that studies with very low power have a low chance of getting a significant result. As a result, resources are wasted on these studies and it would have been better not to conduct these studies, especially when we take into account that each study requires a new ethics approval, training of personal, data analysis time, etc.  All of these costs multiply with the number of studies that are conducted to get a significant result.  At the other extreme, power increases as a log-function of sample size. This means, once power has achieved a certain level, it requires more and more resources to increase power even further. Moreover, 80% power means that 8 out of 10 studies are significant and 90% power means that 9 out of 10 studies are significant. The extra costs of increasing power to 90% may not warrant the increase in success rate from 8 to 9 studies.  For this reason, Cohen did not really suggest that infinite sample sizes are optimal. Instead, he suggested that researchers should aim for 80% power. That is 4 out of 5 studies that examine a real effect show a significant result.

However, FER’s simulations come to a different conclusion.  Their Figure suggests that studies with 30% power are just as good as studies with 70% power and could be even better than studies with 80% power.

For example, if a hypothesis is 75% likely to be true, which might be the case if the finding had a strong theoretical foundation, the most efficient use of field-wide N appears to favor power of ~25% for d=.41 and ~40% for d=.80.

The problem with taking these results seriously is that the criterion N per true discovery does not take into account the costs of a type-I error.  Conducting studies with small samples and low power can produce a larger number of significant results than a smaller sample of studies with large samples, simply due to the larger number of studies. However, it also implies a higher rate of false positives.  Thus, it is important to take the seriousness of a type-I error or a type-II error into account.

So, let’s use a scenario where original results need to be replicated. In fact, many journals require at least two if not more significant results to provide evidence for an effect.  The researcher who conducts many studies with low power has a problem because the probability of obtaining two significant results in a row has only a power-squared probability of getting the desired result.  Even if a single significant result is reported, other researchers need to replicate this finding and many of these replication studies will fail, until eventually a replication study with a significant result corroborates the original finding.

In a simulation with d = .4 and an equal proportion of null-hypothesis and real effects, a researcher with 80% power (N = 200, d = .4, alpha = .05, two-tailed), needs about 900 participants for every discovery.  A researcher with 20% power (N = 40, d = .4, alpha = .05, two-tailed) needs about 1800 participants for every discovery.

When the rate of true null-results decreases, the number of true discoveries increases and it is easier to make true discoveries.  Nevertheless, the advantage of high powered studies remains. It takes about half of the participants for high powered studies to make a true discovery than for low powered studies (N = 665 vs. 1157).

The reason for the discrepancy between my results and FER’s result is that they do not take replicability into account. This is ironic because their title suggest that they are going to write about replicability, when they actually ignore that results from small studies with low power have low replicability. That is, if we only try to get a result once, it can be more efficient to do so with small, underpowered studies because random sampling error will often dramatically inflate effect sizes and produce a significant result. However, this inflation is not replicable and replication studies are likely to produce non-significant results and cast doubt on the original finding.  In other words, they ignore the key characteristic of replicability that replication studies of the same effect should produce significant results again.  Thus, FER’s argument is fundamentally flawed because it ignores the very key concept of replicability. Low powered studies are less replicable and original studies that are not replicable make it impossible to create a cumulative science.

The problems of underpowered studies increase exponentially in a research environment that rewards publication of discoveries, whether they are true or false, and provides no incentives for researches to publish non-significant results, even if these non-significant results challenge the significant results of an original article.  Rather than treating these unsuccessful warning sign that the original results might have been false positives, the non-significant result is treated as evidence that the replication study must have been flawed; after all, the original study found the effect.  Moreover, the replication study might just have low power and the effect exists.  As a result, false positive results can poison theory development because theories have to explain findings that are actually false positives, and researchers continue to conduct unsuccessful replication studies because they are unaware that other researchers have already failed to replicate an original false positive result.  These problems have been discussed at length in the years, but FER blissfully ignore these arguments and discussion.

Since 2011, psychological science has witnessed major changes in its standard operating procedures—changes that hold great promising for bolstering the replicability of our science. We have come a long way, we hope, from the era in which editors routinely encouraged authors to jettison studies or variables with ambiguous results, the file drawer received only passing consideration, and p<.05 was the statistical holy of holies. We remain, as in FER2015, enthusiastic about such changes. Our goal is to work alongside other meta-scientists to generate an empirically grounded, tradeoff-based framework for improving the overall quality of our science.

That sounds good, but it is not clear what FER bring to the table.

We must focus greater attention on establishing which features are most important in a given research context, the extent to which a given research practice influences the alignment of a collective knowledge base with each of the relevant features, and, all things considered, which research practices are optimal in light of the various tradeoffs involved. Such an approach will certainly prioritize replicability, but it will also prioritize other features of a high-quality science, including discovery, internal validity, external validity, construct validity, consequentiality, and cumulativeness.

What is lacking here is a demonstration that it is possible to prioritize internal validity, external validity, conequentiality, and cumulativeness without replicability. How do we build on results that emerge only in one out of two, three, or five studies, let alone 1 out of 10 studies?  FER create the illusion that we can make more true discoveries by conducting many small studies with low power.  This is true, in the limited sense of needing fewer participants for an initial discovery. But their own criterion of cumulativeness implies that we are not interested in a single finding that may or may not replicate. To build on original findings, others should be able to redo a study and get a significant result again.  This is what Fisher had in mind and what Neyman and Pearson formalized into power analysis.

FER also overlook a much simpler solution to balance the rate of original recovery and replicability.  Namely, researchers an increase the type-I error rate from the conventional 5% criterion to 20% (or more).  As the type-I error rate increases, power increases.  At the same time, readers are properly warned that the results are only suggestive, but definitely require further researcher and cannot be treated as evidence that needs to be incorporated in a theory.  At the same time, researchers with large samples do not have to waste their resources on rejecting H0 with apha = .05 and 99.9% power. They can use their resources to make more definitive statements about their data and reject H0 with a p-value that corresponds to 5 standard deviations of a standard normal (5 sigma rule in particle physics).

No matter what the solution to the replicability crisis in psychology is, the solution cannot be a continuation of the old practice to conduct numerous statistical tests on a small sample and then report only the results that are statistically significant at p < .05.  It is unfortunate that FER’s article can be easily misunderstood as suggesting that using small samples and testing for significance with p < .05 and low power can be a viable research strategy in some specific context.  I think they failed to make their case and to demonstrate in which research context this strategy benefits psychology.

 

 

 

 

 

 

 

The decline effect in social psychology: Evidence and possible explanations

The decline effect predicts that effects become weaker over time.  It has been proposed as a viable explanation for the replication crisis (Lehrer, 2010).  However, evidence for the decline effect has been elusive (Schooler, 2011).  One major problem, at least in psychology, is that researchers rarely conduct exact replication studies of the original studies.  However, in recent years, psychologists have started to conduct Registered Replication Reports.  An original study is replicated by several labs as closely as possible to the original study.  This makes it possible to examine the decline effect.  The decline effect predicts that original studies have larger effect sizes than replication studies.

One problem is that studies often have small samples and large sampling error.  This makes it difficult to interpret observed effect sizes. One solution to this problem is to focus on the relative extremity of an effect size relative to effect sizes in replication studies.  According to the decline effect, effect sizes in original studies should be higher than effect sizes in replication studies.  In the most extreme case, the original study would have the largest effect size.  If there were 20 studies with identical effect sizes, the probability that the original study reported the strongest effect is only 1/20 = .05.

Method

I ordered all effect sizes from the original study and replication studies in decreasing order of effect sizes. I then recorded the rank of the original study.  R-Code: which(c(1:length(d)) [order(d,decreasing=TRUE)] == 1)# 1 = number of original study.

Results

The results are shown in Table 1. For 5 out of 6 RRRs, the original study reported the largest effect size.  In all of these RRRs, all of the replication studies failed to replicate a significant effect.  Only the second verbal overshadowing RRR produced conclusive evidence for an effect. Yet, the effect size reported in the original study was still the third largest out of 24 studies.  These results provide strong support for the decline effect.

To examine whether this pattern of results could have occurred by chance, I computed the probability of this outcome under the null-hypothesis that all studies have the same population effect size .  The chance of drawing the original study on the first draw is 1/n with n = number of studies.  The probabilities are very low.  For the verbal overshadowing RRR2, the probability of drawing the original study on the third draw is .12 (1 – 23*22*21/(24*23*22)).  A meta-analysis of the six probabilities with Stouffer’s method provides strong evidence against the null-hypothesis, z = 3.8, p < .0001.

Table 1

VerbalOvershadowing RRR1 1 out of 33 p = .03
VerbalOvershadowing RRR2 3 out of 24 p = .12
Ego-depletion: 1 out of 24 p = .04
ImperfectAction 1 out of 13 p = .08
CommitmentForgiveness 1 out of 17 p = .06
Facial Fedback 1 out of 18 p = .06
Combined 1 out of 14,1222 p = 0.00007

Discussion

A test of the decline effect with the data from all Registered Replication Reports provides strong evidence for the hypothesis that effect sizes of original studies are larger and decrease over time.

YThe same holds for ego-depletion.  Initially, performing a difficult task led to a reduction in effort on a second task. But collective consciousness about this effect means that participants are aware of this effect and compensate for it by working harder.  This theory is consistent with the fact that the decline effect is pervasive in social psychology, but not in other sciences. For example, the effect of eating cheesecake on weight gain has unfortunately not decreased as the obesity epidemic shows.  Also computers are getting faster not slower. Thus, not all cause-effect relationships decline over time.

It is only cause-effect relationship of mental processes where collective consciousness can moderate the strength of cause-effect relationships.  Thus, the collective consciousness hypothesis suggests that the replication crisis in psychology is not a replication crisis, but actually a real phenomenon.  The original studies did make a real discovery but ironically the discovery made the effect disappear.

Limitations

This study has a number of limitations and there are alternative explanations for the finding that seminal articles report stronger effect sizes.  One possibility is regression to the mean (Fiedler). Regression to the mean implies that an observed effect size in a small sample will not replicate with the same effect size. The next study is more likely to produce a result that is closer to the mean.  The problem with this hypothesis is that it does not explain why the mean of replication studies is often very close to zero.  Thus, it fails to explain the mysterious disappears of effects and the elusive nature of findings in social psychology that makes the decline effect so interesting.

Another possible explanation is publication bias. Maybe researchers are simply publishing results that are consistent with their theories and they do not publish disconfirming evidence (Sterling, 1959).  However, this explanation does not explain the fact that at the time of the original studies other studies reported successful results.  In fact, many of the RRR studies were taken from articles that reported several successful studies.  The failure to replicate the effect occurred only several years later when there was sufficient time for collective consciousness to make the effect disappear.

Finally, Schooler (personal communication 2012) proposed an interesting theory.  Astrophysicists have calculated that it is very likely that other intelligent live evolved in other parts of the universe way before human evolution.  Like humans now, these intelligent life forms were getting increasingly bored with their limited reality and started building artificially simulated virtual worlds and enjoyed this virtual world to entertain themselves.  At some point, agents in these games were given the illusion of self-consciousness that they are real agents with their own goals, feelings, and thoughts.  According to this theory, we are not real agents, but virtual agents in a computer game of a much more intelligent life form. Although the simulation software works very well, there are some bugs and glitches that make the simulation behave in strange ways. Often the simulated agents do not notice this, but clever experiments by parapsychologists (Bem, 2011) can sometimes reveal these inconsistencies.  Many of the discoveries in social psychology are also caused by these glitches.  The effects can be observed for some time, but then a software update makes them disappear.  This theory would also explain why original results disappeared in replication studies.

Future Research

It is difficult to distinguish empirically between the collective consciousness hypothesis and the simulated-world hypothesis.  However, the two theories make different predictions about findings that do not enter collective consciousness.  A researcher could conduct a study, but not analyze the data, and replicate the study 10 years later. Only then the results of the two studies are analyzed. The collective consciousness hypothesis predicts that there will be no decline effect.  The simulated-world hypothesis predicts that the decline effect will emerge.  Of course, a single original study is most likely to show no effect because it is very difficult to find original effects that are subject to the decline effect.  Thus, it requires many studies that will not show any effect, but when original studies show an effect, it will be very interesting to see whether they replicate. If they do not replicate, it provides evidence for the simulated-world hypothesis that we are just simulated agents in a computer game of a life-form much more intelligent than we think we are.  So, I propose that social psychologists plan a series of carefully planned time-lagged replication studies to answer the most fundamental question of humanity.  Do we really exist because we think we do or is it all a big illusions?

 

 

 

Fritz Strack’s self-serving biases in his personal account of the failure to replicate his most famous study.

[please hold pencil (pen does not work) like this while reading this blog post]

In “Sad Face: Another classic finding in psychology—that you can smile your way to happiness—just blew up. Is it time to panic yet?”  b Daniel Engber, Fritz Strack gets to tell his version of the importance of his original study and what it means that it failed to replicate in a recent attempt to replicate his original results in 17 independent replication studies.   In this blog post, I provide my commentary on Fritz Strack’s story to reveal inconsistencies, omissions of important fact, and false arguments to discount the results of the replication studies.

PART I:  Prior to the Replication of Strack et al. (1988)

In 2011, many psychologists lost confidence in social psychology as a science.  One social psychologists had fabricated data at midnight in his kitchen.  Another presented incredible results that people can foresee random events in the future.  And finally, a researcher failed to replicate a famous study where subtle reminders of elderly people made students walk more slowly.  A New Yorker article captured the mood of the time.  It wasn’t clear which findings one should believe and would replicate under close scrutiny?  In response, psychologists created a new initiative to replicate original findings across many independent labs.  A first study produced encouraging results.  Many classic findings in psychology (like the anchoring effect) replicated sometimes even with stronger effect sizes than in the original study.  However, some studies didn’t replicate.  Especially, results from a small group of social psychologists who had built their career around the idea that small manipulations can have strong effects on participants’ behavior without participants’ awareness (such as the elderly priming study) did not replicate well.   The question was which results from this group of social psychologists who study unconscious or implicit processes would replicate?

Quote “The experts were reluctant to step forward. In recent months their field had fallen into scandal and uncertainty: An influential scholar had been outed as a fraud; certain bedrock studies—even so-called “instant classics”—had seemed to shrivel under scrutiny. But the rigidity of the replication process felt a bit like bullying. After all, their work on social priming was delicate by definition: It relied on lab manipulations that had been precisely calibrated to elicit tiny changes in behavior. Even slight adjustments to their setups, or small mistakes made by those with less experience, could set the data all askew. So let’s say another lab—or several other labs—tried and failed to copy their experiments. What would that really prove? Would it lead anyone to change their minds about the science?”

The small group of social psychologist felt under attack.  They had published hundreds of articles and become famous for demonstrating the influence of unconscious processes that by definition were ignored by people when they tried to understand their own behaviors because they operated in secrecy, undetected by conscious introspection.  What if all of their amazing discoveries were not real?  Of course, the researchers were aware that not all studies worked. After all, they often encountered failures to find these effects in their own lab.  It often required several attempts to get the right conditions to produce results that could be published.  If a group of researchers would just go into the lab and do the study once, how would we know that they did everything right. Given ample evidence of failure in their own labs, nobody from this group wanted to step forward and replicate their own study or subject their study to a one-shot test. 

Quote “Then on March 21, Fritz Strack, the psychologist in Wurzburg, sent a message to the guys. “Don’t get me wrong,” he wrote, “but I am not a particularly religious person and I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.” So if the skeptics wanted something to examine—a test case to stand in for all of social-psych research—then let them try his work.”

Fritz Strack was not afraid of failure.  He volunteered his most famous study for a replication project.

Quote “ In 1988, Strack had shown that movements of the face lead to movements of the mind. He’d proved that emotion doesn’t only go from the inside out, as Malcolm Gladwell once described it, but from the outside in.”

It is not exactly clear why Strack picked his 1988 for replication.  The article included two studies. The first study produced a result that is called marginally significant.  That is, it did not meet the standard criterion of evidence, a p-value less than .05 (two-tailed).  But the p-value was very close to .05 and less than .10 (or .05 one-tailed).   This finding alone would not justify great confidence in the replicability of the original finding.  Moreover, a small study with so much noise makes it impossible to estimate the true effect size. The observed effect size in the study was large, but this could have been due to luck (sampling error).  In a replication study, the effect size could be a lot smaller, which would make it difficult to get a significant result in a replication study.

The key finding of this study was that manipulating participants’ facial muscles appeared to influence their feelings of amusement in response to funny cartoons without participants’ awareness that their facial muscles contributed to the intensity of the experience.  This finding made sense in the context of a long tradition of theories that assumed feedback from facial muscles plays an important role in the experience of emotions. 

Strack seemed to be confident that his results would replicate because many other articles also reported results that seemed to support the facial feedback hypothesis.  His study became famous because it used an elaborate cover story to ensure that the effect occurred without participants’ awareness.

Quote: “In lab experiments, facial feedback seemed to have a real effect…But Strack realized that all this prior research shared a fundamental problem: The subjects either knew or could have guessed the point of the experiments. When a psychologist tells you to smile, you sort of know how you’re expected to feel.”

Strack was not the first to do so. 

Quote: “In the 1960s, James Laird, then a graduate student at the University of Rochester, had concocted an elaborate ruse: He told a group of students that he wanted to record the activity of their facial muscles under various conditions, and then he hooked silver cup electrodes to the corners of their mouths, the edges of their jaws, and the space between their eyebrows. The wires from the electrodes plugged into a set of fancy but nonfunctional gizmos… Subjects who had put their faces in frowns gave the cartoons an average rating of 4.4; those who put their faces in smiles judged the same set of cartoons as being funnier—the average jumped to 5.5.”

 

A change by 1.1 points on a rating scale is a huge effect and consistent results across different studies would suggest that the effect can be easily replicated.   The point of Strack’s study was not to demonstrate the effect, but to improve the cover story that made it difficult for participants to guess the real purpose of the study.

“Laird’s subterfuge wasn’t perfect, though. For all his careful posturing, it wasn’t hard for the students to figure out what he was up to. Almost one-fifth of them said they’d figured out that the movements of their facial muscles were related to their emotions. Strack and Martin knew they’d have to be more crafty. At one point on the drive to Mardi Gras, Strack mused that maybe they could use thermometers. He stuck his finger in his mouth to demonstrate.  Martin, who was driving, saw Strack’s lips form into a frown in the rearview mirror. That would be the first condition. Martin had an idea for the second one: They could ask the subjects to hold thermometers—or better, pens—between their teeth. This would be the stroke of genius that produced a classic finding in psychology.”

So in a way, Strack et al.’s study was a conceptual replication study of Laird’s study that used a different manipulation of facial muscles. And the replication study was successful.

“The results matched up with those from Laird’s experiment. The students who were frowning, with their pens balanced on their lips, rated the cartoons at 4.3 on average. The ones who were smiling, with their pens between their teeth, rated them at 5.1. What’s more, not a single subject in the study noticed that her face had been manipulated. If her frown or smile changed her judgment of the cartoons, she’d been totally unaware.”

However, even though the effect size is still large, an .8 difference in ratings, the effect was only marginally significant.  A second study by Strack et al. also produced only a marginally significant results. Thus, we may start to wonder why the researchers were not able to produce stronger evidence for the effect that would produce a significant result at the conventional criterion that is required for claiming a discovery, p < .05 (two-tailed)?   And why did this study become a classic without stronger evidence that the effect is real and that the effect is really as large as the reported effect sizes in these studies.  The effect size may not matter for basic research studies that merely want to demonstrate that the effect exists, but it is important for applications to the real word. If an effect is large under strictly controlled laboratory conditions, the effect is going to be much smaller in real world situations where many of the factors that are controlled in the laboratory also influence emotional experiences.  This might also explain why people normally do not notice the contribution of their facial expressions to their experiences.  Relative to their mood, the funniness of a joke, the presence of others, and a dozen more contextual factors that influence our emotional experiences, feedback from facial muscles may make a very small contribution to emotional experiences.  Strack seems to agree.

Quote “It was theoretically trivial,” says Strack, but his procedure was both clever and revealing, and it seemed to show, once and for all, that facial feedback worked directly on the brain, without the intervention of the conscious mind. Soon he was fielding calls from journalists asking if the pen-in-mouth routine might be used to cure depression. He laughed them off. There are better, stronger interventions, he told them, if you want to make a person happy.”

Strack may have been confident that his study would replicate because other publications used his manipulation and also reported significant results.  And researchers even proposed that the effect is strong enough to have practical implications in the real world.  One study even suggested that controlling facial expressions can reduce prejudice.

Quote: “Strack and Martin’s method would eventually appear in a bewildering array of contexts—and be pushed into the realm of the practical. If facial expressions could influence a person’s mental state, could smiling make them better off, or even cure society’s ills? It seemed so. In 2006, researchers at the University of Chicago showed that you could make people less racist by inducing them to smile—with a pen between their teeth—while they looked at pictures of black faces.”

The result is so robust that replicating it is a piece of cake, a walk in the park, and works even in classroom demonstrations.

“Indeed, the basic finding of Strack’s research—that a facial expression can change your feelings even if you don’t know that you’re making it—has now been reproduced, at least conceptually, many, many times. (Martin likes to replicate it with the students in his intro to psychology class.)”

Finally, Strack may have been wrong when he laughed off questions about curing depression with controlling facial muscles.  Apparently, it is much harder to commit suicide if you put a pen in your mouth to make yourself smile.

Quote: “In recent years, it has even formed the basis for the treatment of mental illness. An idea that Strack himself had scoffed at in the 1980s now is taken very seriously: Several recent, randomized clinical trials found that injecting patients’ faces with Botox to make their “frown lines” go away also helped them to recover from depression.”

So, here you have it. If you ignore publication bias and treat the mountain of confirmatory evidence with a 100% success rate in journals as credible evidence, there is little doubt that the results would replicate. Of course, by the same standard of evidence there is no reason to doubt that other priming studies would replicate, which they did until a group of skeptical researchers tried to replicate the results and failed to do so. 

Quote: “Strack found himself with little doubt about the field. “The direct influence of facial expression on judgment has been demonstrated many, many times,” he told me. “I’m completely convinced.” That’s why he volunteered to help the skeptics in that email chain three years ago. “They wanted to replicate something, so I suggested my facial-feedback study,” he said. “I was confident that they would get results, so I didn’t know how interesting it would be, but OK, if they wanted to do that? It would be fine with me.”

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PART II:  THE REPLICATION STUDY

The replication project was planned by EJ Wagenmakers, who made his name as a critic of research practices in social psychology in response to Bem’s (2011) incredible demonstration of feelings that predict random future events.  Wagenmakers believes that many published results are not credible because the studies failed to test theoretical predictions. Social psychologists would run many studies and publish results when they discovered a significant result with p < .05 (at least one-tailed).  When the results did not become significant the study was considered a failure and not reported.  This practice makes it difficult to predict which results are real and replicate and which results are not real and do not replicate.  Wagenmakers estimated that the facial feedback study had a 30% chance to replicate.

Quote “Personally, I felt that this one actually had a good chance to work,” he said. How good a chance? I gave it a 30-percent shot.” [Come again.  A good chance is 30%?]

A 30% probability may be justified because a replication project by the Open Science Collaborative found that only 25% of social psychological results were successfully replicated.  However, this project used only slightly larger samples than the original studies.  In the replication of the facial feedback hypothesis, 17 labs with larger samples than the original studies and nearly 2000 participants were going to replicate the original study.  The increase in sample size increases the chances of producing a significant result even if the effect size of the original study was vastly inflated.  If a result is not significant with 2,000 participants, it becomes possible to say that the effect may actually not exist or that the effect size is so small to be practically meaningless and definitely have no relevance for the treatment of depression.  Thus, the prediction that there is only a 30% chance of success implies that Wagenmakers was very skeptical about the original results and expected a drastic reduction in the effect size.

Quote “In a sense, he was being optimistic. Replication projects have had a way of turning into train wrecks. When researchers tried to replicate 100 psychology experiments from 2008, they interpreted just 39 of the attempts as successful. In the last few years, Perspectives on Psychological Science has been publishing “Registered Replication Reports,” the gold standard for this type of work, in which lots of different researchers try to re-create a single study so the data from their labs can be combined and analyzed in aggregate. Of the first four of these to be completed, three ended up in failure.”

There were good reasons to be skeptical.  First, the facial feedback theory is controversial. There are two camps in psychology .One camp assumes that emotions are generated in the brain in direct response to cognitive appraisals of the environment. Others have argued that emotional experiences are based on bodily feedback.  The controversy goes back to James versus Cannon and lead to the famous Lazarus-Zajonc debate in the 1980s at the beginning of modern emotion research.  There is also the problem that it is statistically improbable that Strack et al. (1988) would get marginally significant results twice in a row in two independent replications of their study.  Sampling error makes p-values move around and the chance of getting p < .10 and p > .05 twice in a row is slim. This suggests that the evidence was partially obtained with a healthy dose of sampling error and that a replication study would produce weaker effect sizes.

Quote: The work on facial feedback, though, had never been a target for the doubters; no one ever tried to take it down. Remember, Strack’s original study had confirmed (and then extended) a very old idea. His pen-in-mouth procedure worked in other labs.

Strack also had some reasons why the replication project would not produce straight replications of his findings, because he claims that the original study did not produce a huge effect.

Quote “He acknowledged that the evidence from the paper wasn’t overwhelming—the effect he’d gotten wasn’t huge. Still, the main idea had withstood a quarter-century of research, and it hadn’t been disputed in a major, public way. “I am sure some colleagues from the cognitive sciences will manage to come up with a few nonreplications,” he predicted. But he thought the main result would hold.”

But that is wrong.  The study did produce a surprisingly huge effect.  It just didn’t produce strong evidence that this effect was caused by facial feedback rather than problems with the randomized assignment of participants to conditions.  His sample sizes were so small that the large effect was only a bit more than 1.5 times of the standard deviation, which is just enough to claim a discovery with p < .05 one-tailed, but not 2 times of the standard deviation, which is needed to claim a discovery with p < .05 two-tailed.   So, the reported effect size was huge, but the strength of evidence was not.  Taking the reported effect size at face value, one would predict that only every other study would produce a significant result and the other studies would fail to replicate his results.  So even if 17 laboratories would successfully replicate his study and the true effect size was as large as the effect size reported by Strack et al., only half of the labs would be able to claim a successful replication.  As sample sizes were a bit larger in the replication studies, the percentage would be a bit higher, but clearly nobody should expect that all labs individually produce at least marginally significant results.  In fact, it is unlikely that Strack was able to get two significant results in his two reported studies.

After several years of planning, collecting data, and analyzing the data the results were reported.  Not a single lab had produced a significant result. More important, even a combined analysis of data from close to 2,000 participants showed no effect.  The effect size was close to zero.   In other words, there was no evidence that facial feedback had any influence on ratings of amusement in response to cartoons.  This is what researchers call an epic fail.  The study did not just fail in a replication with a smaller sample. It didn’t produce a significant result with a smaller effect size estimate.  The effect just doesn’t appear to be there, although even with 2,000 participants it is not possible to say that the effect is zero.  The results leave a possibility that a very small effect may exist, but an even larger sample would be needed to test this hypothesis. At the same time, the results are not inconsistent with the original results because the original study had so much noise that the population effect size could have been close to zero.   

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PART III: Response to the Replication Failure

We might think that Strack was devastated by the failure to replicate his most famous result that he has produced in his research career.  However, he is rather unmoved by these results.

Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.

This is a bit odd because earlier Strack assured us that he is not religious and trusts the scientific method. “I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.”   So here we have two original studies with weak evidence for an effect and 17 studies with no evidence for the effect and if we combine the information of all 19 studies, we have no evidence for an effect, and to believe in an effect even though 19 studies fail to provide scientific evidence for it seems a bit religious although I would make a distinction between really religious individuals who know that they believe in something and wanna-be-scientists who believe that they know something.  How does Strack justify his belief in an effect that just failed to replicate?  He refers to an article (take-down) by himself that according to his own account shows fundamental problems with the idea that failed replication studies provide meaningful information.  Apparently, only original studies provide meaningful information and when replication studies fail to replicate the results of original studies there must be a problem with the replication studies.

Quote: “Two years ago, while the replication of his work was underway, Strack wrote a takedown of the skeptics’ project with the social psychologist Wolfgang Stroebe. Their piece, called “The Alleged Crisis and the Illusion of Exact Replication,” argued that efforts like the RRR reflect an “epistemological misunderstanding,”

Accodingly, Bem(2011) did successfully demonstrate that humans (at least extraverted humans) can successfully predict random events in the future and learning after an exam can retroactively improve performance on the completed exam.  The fact that replication studies failed to replicate these results only shows an epistemic misunderstanding that we can learn anything from replication studies by skeptics.  So what is the problem with replication studies?

Quote: “Since it’s impossible to make a perfect copy of an old experiment. People change, times change, and cultures change, they said. No social psychologist ever steps in the same river twice. Even if a study could be reproduced, they added, a negative result wouldn’t be that interesting, because it wouldn’t explain why the replication didn’t work.”

We cannot reproduce exactly the same conditions of the original experiment.  But, why is that important.  The same paradigm was allegedly used to reduce prejudice and cure depression, in studies that are wildly different from the original studies.  It worked even then. So, why did it not work when the original study was replicated as closely as possible. And why would we care about a study that worked (marginally) in 92 undergraduate students at the University of Illinois in the 1980s in 2016?  We don’t.  For humans in 2016, the results of a study in 2015 are more relevant. Maybe it worked, may be it didn’t. We will never know, but now we do now that it typically doesn’t work in 2015.  Maybe it will work again in 2017. Who knows. But we cannot claim that there is good support for the facial feedback theory since Darwin came up with it.

But Strack goes further.  When he looks at the results of the replication studies, he does not see what the authors of the replication studies see. 

Quote: “So when Strack looks at the recent data he sees not a total failure but a set of mixed results.”

17 studies all find no effect and all studies are consistent with the hypothesis that there is no effect; the 95% confidence interval includes 0, which is also true for Stracks’ original two studies.”  How can somebody see mixed results in this consistent pattern of results?

Quote:  Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed?

He simply post-hoc divides studies into studies that produced a positive result and studies that produced a negative result. There is no justification for this because none of these studies are individually significantly different from each other and the overall test shows that there is no heterogeneity; that is the results are consistent with the hypothesis that the true population effect size is 0 and that all of the variability in effects across studies is just random noise that is expected from studies with modest sample sizes.

Quote: “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. How could he turn his back on all that evidence?”

And with this final quote, Strack is leaving the realm of scientific discourse and proper interpretation of empirical facts.  He is willing to disregard the results of a scientific test of the facial feedback hypothesis that he initially agreed to.  It is now clear why he agreed to it.  He never considered it a real test of his theory. No matter what the results would be he would maintain his believe in his couple of marginally significant results that are statistically improbable.  Social psychologists have of course studied how humans respond to negative information that challenges their self-esteem and world views.  Unlike facial feedback, the results are robust and not surprising.  Humans are prone to dismiss inconvenient evidence and to construe sometimes ridiculous arguments in order to prop up cherished false beliefs.   As such, Strack’s response to the failure of his most famous article is a successful demonstration that some findings in social psychology are replicable;  it just so happens that Strack’s study is not one of these findings.

Strack comes up with several objections to the replication studies that show his ignorance about the whole project.  For example, he claims that many participants may have guessed the purpose of the study because the study is now a textbook finding.  However, the researchers who conducted the replication studies made sure that the study was conducted before the study was covered in class and some universities do not cover it at all. Moreover, just like Laird, participants who guessed the purpose were excluded.  A lot more participants were excluded because they didn’t hold the pen properly. Of course, this should strengthen the effect because the manipulation should not work when the wrong facial muscles are activated.

Strack even claims that the whole project lacked a research question.

Quote: “Strack had one more concern: “What I really find very deplorable is that this entire replication thing doesn’t have a research question.” It does “not have a specific hypothesis, so it’s very difficult to draw any conclusions,” he told me.”

This makes no sense. Participants were randomly allocated to two conditions and a dependent variable was measured.  The hypothesis was that holding the pen in a way that elicits a smile leads to higher ratings of amusement than holding the pen in a way that leads to a frown.  The empirical question was whether this manipulation would have an effect and this was assessed with a standard test of statistical significance.  The answer was that there was no evidence for the effect.   The research question was the same as in the original study. If this is not a research question than the original study also had no research question. 

And finally, Strack makes the unscientific claim that it simply cannot be true that the reported studies all got it wrong.

Quote: The RRR provides no coherent argument, he said, against the vast array of research, conducted over several decades, that supports his original conclusion. “You cannot say these [earlier] studies are all p-hacked,” Strack continued, referring to the battery of ways in which scientists can nudge statistics so they work out in their favor. “You have to look at them and argue why they did not get it right.”

Scientific journals select studies that produced significant results. As a result, all prior studies were published because they produced a significant (or at least marginally significant) result.  Given the selectin for significance, there is no error control.  The number of successful replications in the published literature tells us nothing about the truth of a finding.  We do not have to claim that all studies were p-hacked. We can just say all studies were selected to be significant and that is true and well known.  As a result, we do not know which results will replicate until we have conducted replication studies and do not select for significance. This is what the RRR did. As a result, it provides the first unbiased and real empirical test of the facial feedback hypothesis and it failed. That is science. Ignoring it is not.

Closer inspection of the original article by Daniel Engber shows further problems.  

Quote: For the second version, Strack added a new twist. Now the students would have to answer two questions instead of one: First, how funny was the cartoon, and second, how amused did it make them feel? This was meant to help them separate their objective judgments of the cartoons’ humor from their emotional reactions. When the students answered the first question—“how funny is it?,” the same one that was used for Study 1—it looked as though the effect had disappeared. Now the frowners gave the higher ratings, by 0.17 points. If the facial feedback worked, it was only on the second question, “how amused do you feel?” There, the smilers scored a full point higher. (For the RRR, Wagenmakers and the others paired this latter question with the setup from the first experiment.) In effect, Strack had turned up evidence that directly contradicted the earlier result: Using the same pen-in-mouth routine, and asking the same question of the students, he’d arrived at the opposite answer. Wasn’t that a failed replication, or something like it?”

Strack dismisses this concern as well, but Daniel Engber is not convinced.

Quote:  “Strack didn’t think so. The paper that he wrote with Martin called it a success: “Study 1’s findings … were replicated in Study 2.”… That made sense, sort of. But with the benefit of hindsight—or one could say, its bias—Study 2 looks like a warning sign. This foundational study in psychology contained at least some hairline cracks. It hinted at its own instability. Why didn’t someone notice?

And nobody else should be convinced.  Fritz Strack is a prototypical example of a small group of social psychologists that has ruined social psychology by engaging in a game of publishing results that were consistent with theories of strong and powerful effects of stimuli on people’s behavior outside their awareness.  These results were attention-grabbing just like annual returns of 20% would be eye-catching returns.  Many people invested in these claims on the basis of flimsy evidence that doesn’t even withstand scrutiny by a science journalist.  And to be clear, only a few of them did go as far to fabricate data. But many others fabricated facts by publishing only studies that supported their claims while hiding evidence from studies that failed to show the effect.  Now we see what happens when these claims are subjected to real empirical tests that can succeed or fail. Many of the fail.  For future generations it is not important why they did what they did and how they feel about it now. What is important is that we realize that many results in textbooks are not based on solid evidence and social psychology needs to change the way they conduct research if it wants to become a real science that builds on empirically verifiable facts.  Strack’s response to the RRR is what it is a defensive reaction to evidence that his famous article was based on a false positive result.