All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

‘Before you know it’ by John A. Bargh: A quantitative book review

November 28, Open Draft/Preprint (Version 1.0)
[Please provide comments and suggestions]

In this blog post I present a quantitative review of John A Bargh’s book “Before you know it: The unconscious reasons we do what we do”  A quantitative book review is different from a traditional book review.  The goal of a quantitative review is to examine the strength of the scientific evidence that is provided to support ideas in the book.  Readers of a popular science book written by an eminent scientist expect that these ideas are based on solid scientific evidence.  However, the strength of scientific evidence in psychology, especially social psychology has been questioned.  I use statistical methods to examine how strong the evidence actually is.

One problem in psychological publishing is publication bias in favor of studies that support theories, so called publication bias.  The reason for publication bias is that scientific journals can publish only a fraction of results that scientists produce.  This leads to heavy competition among scientists to produce publishable results, and journals like to publish statistically significant results; that is studies that provide evidence for an effect (e.g., “eating green jelly beans cures cancer” rather than “eating red jelly beans does not cure cancer”).  Statisticians have pointed out that publication bias undermines the meaning of statistical significance, just like counting only hits would undermine the meaning of batting averages. Everybody would have an incredible batting average of 1.00.

For a long time it was assumed that publication bias is just a minor problem. Maybe researchers conducted 10 studies and reported only 8 significant results while not reporting the remaining two studies that did not produce a significant result.  However, in the past five years it has become apparent that publication bias, at least in some areas of the social sciences, is much more severe, and that there are more unpublished studies with non-significant results than published results with significant results.

In 2012, Daniel Kahneman (2012) raised doubts about the credibilty of priming research in an open email letter addressed to John A. Bargh, the author of “Before you know it.”  Daniel Kahneman is a big name in psychology; he won a Nobel Prize for economics in 2002.  He also wrote a popular book that features John Bargh’s priming research (see review of Chapter 4).  Kahneman wrote “As all of you know, of course, questions have been raised about the robustness of priming results…. your field is now the poster child for doubts about the integrity of psychological research.”

Kahneman is not an outright critic of priming research. In fact, he was concerned about the future of priming research and made some suggestions how Bargh and colleagues could alleviate doubts about the replicability of priming results.  He wrote:

“To deal effectively with the doubts you should acknowledge their existence and confront them straight on, because a posture of defiant denial is self-defeating. Specifically, I believe that you should have an association, with a board that might include prominent social psychologists from other fields. The first mission of the board would be to organize an effort to examine the replicability of priming results.”

However, prominent priming researchers have been reluctant to replicate their old studies.  At the same time, other scientists have conducted replication studies and failed to replicate classic findings. One example is Ap Dijksterhuis’s claim that showing words related to intelligence before taking a test can increase test performance.  Shanks and colleagues tried to replicate this finding in 9 studies and came up empty in all 9 studies. More recently, a team of over 100 scientists conducted 24 replication studies of Dijsterhuis’s professor priming study.  Only 1 study successfully replicated the original finding, but with a 5% error rate, 1 out of 20 studies is expected to produce a statistically significant result by chance alone.  This result validates Shanks’ failures to replicate and strongly suggests that the original result was a statistical fluke (i.e., a false positive result).

Proponent of priming research like  Dijksterhuis “argue that social-priming results are hard to replicate because the slightest change in conditions can affect the outcome” (Abbott, 2013, Nature News).  Many psychologists consider this response inadequate.  The hallmark of a good theory is that it predicts the outcome of a good experiment.  If the outcome depends on unknown factors and replication attempts fail more often than not, a scientific theory lacks empirical support.  For example,  Kahneman wrote in an email that the apparent “refusal to engage in a legitimate scientific conversation … invites the interpretation that the believers are afraid of the outcome” (Abbott, 2013, Nature News).

It is virtually impossible to check on all original findings by conducting extensive and expensive replication studies.  Moreover, proponents of priming research can always find problems with actual replication studies to dismiss replication failures.  Fortunately, there is another way to examine the replicability of priming research. This alternative approach, z-curve, uses a statistical approach to estimate replicability based on the results reported in original studies.  Most important, this approach examines how replicable and credible original findings were based on the results reported in the original articles.  Therefore, original researches cannot use inadequate methods or slight variations in contextual factors to dismiss replication failures. Z-curve can reveal that the original evidence was not as strong as dozens of published studies may reveal because it takes into account that published studies were selected to provide evidence for priming effects.

My colleagues and I used z-curve to estimate the average replicability of priming studies that were cited in Kahneman’s chapter on priming research.  We found that the average probability of a successful replication was only 14%. Given the small number of studies (k = 31), this estimate is not very precise. It could be higher, but it could also be even lower. This estimate would imply that for each published significant result, there are  9 unpublished non-significant results that were omitted due to publication bias. Given these results, the published significant results provide only weak empirical support for theoretical claims about priming effects.  In a response to our blog post, Kahneman agreed (“What the blog gets absolutely right is that I placed too much faith in underpowered studies”).

Our analysis of Kahneman’s chapter on priming provided a blue print for this  quantitative book review of Bargh’s book “Before you know it.”  I first checked the notes for sources and then linked the sources to the corresponding references in the reference section.  If the reference was an original research article, I downloaded the original research article and looked for the most critical statistical test of a study. If an article contained multiple studies, I chose one test from each study.  I found 168 usable original articles that reported a total of 400 studies.  I then converted all test statistics into absolute z-scores and analyzed them with z-curve to estimate replicability (see Excel file for coding of studies).

Figure 1 shows the distribution of absolute z-scores.  90% of test statistics were statistically significant (z > 1.96) and 99% were at least marginally significant (z > 1.65), meaning they passed a less stringent statistical criterion to claim a success.  This is not surprising because supporting evidence requires statistical significance. The more important question is how many studies would produce a statistically significant result again if all 400 studies were replicated exactly.  The estimated success rate in Figure 1 is less than half (41%). Although there is some uncertainty around this estimate, the 95% confidence interval just reaches 50%, suggesting that the true value is below 50%.  There is no clear criterion for inadequate replicability, but Tversky and Kahneman (1971) suggested a minimum of 50%.  Professors are also used to give students who scored below 50% on a test an F.  So, I decided to use the grading scheme at my university as a grading scheme for replicability scores.  So, the overall score for the replicability of studies cited by Bargh to support the ideas in his book is F.

 

Before.You.Know.It.Final

This being said, 41% replicability is a lot more than we would expect by chance alone, namely 5%.  Clearly some of the results mentioned in the book are replicable. The question is which findings are replicable and which ones are difficult to replicate or even false positive results.  The problem with 41% replicable results is that we do not know which results we can trust. Imagine you are interviewing 100 eyewitnesses and only 42 of them are reliable. Would you be able to identify a suspect?

It is also possible to analyze subsets of studies. Figure 2 shows the results of all experimental studies that randomly assigned participants to two or more conditions.  If a manipulation has an effect, it produces mean differences between the groups. Social psychologists like these studies because they allow for strong causal inferences and make it possible to disguise the purpose of a study.  Unfortunately, this design requires large samples to produce replicable results and social psychologists often used rather small samples in the past (the rule of thumb was 20 per group).  As Figure 2 shows, the replicability of these studies is lower than the replicability of all studies.  The average replicability is only 24%.  This means for every significant result there are at least three non-significant results that have not been reported due to the pervasive influence of publication bias.

Before.You.Know.It.BS.EXP.png

If 24% doesn’t sound bad enough, it is important to realize that this estimate assumes that the original studies can be replicated exactly.  However, social psychologists have pointed out that even minor differences between studies can lead to replication failures.  Thus, the success rate of actual replication studies is likely to be even less than 24%.

In conclusion, the statistical analysis of the evidence cited in Bargh’s book confirms concerns about the replicability of social psychological studies, especially experimental studies that compared mean differences between two groups in small samples. Readers of the book should be aware that the results reported in the book might not replicate in a new study under slightly different conditions and that numerous claims in the book are not supported by strong empirical evidence.

Replicability of Chapters

I also estimated the replicability separately for each of the 10 chapters to examine whether some chapters are based on stronger evidence than others. Table 1 shows the results. Seven chapters scored an F, two chapters scored a D, and one chapter earned a C-.   Although there is some variability across chapters, none of the chapters earned a high score, but some chapters may contain some studies with strong evidence.

Table 1. Chapter Report Card

Chapter 1 28 F
Chapter 2 40 F
Chapter 3 13 F
Chapter 4 47 F
Chapter 5 50 D-
Chapter 6 57 D+
Chapter 7 24 F
Chapter 8 19 F
Chapter 9 31 F
Chapter 10 62 C-

Credible Findings in the Book

Unfortunately, it is difficult to determine the replicability of individual studies with high precision.  Nevertheless, studies with high z-scores are more replicable.  Particle physicists use a criterion value of z > 5 to minimize the risk that the results of a single study are not a false positive.  I found that psychological studies with a z-score greater than 4 had an 80% chance of being replicated in actual replication studies.  Using this rule as a rough estimate of replicability, I was also able to identify credible claims in the book.  Highlighting these claims does not mean that the other claims are wrong. It simply means that they are not supported by strong evidence.

Chapter 1:    

According to Chapter 1, there seems “to be a connection between the strength of the unconscious physical safety motivation and a person’s political attitudes.”   The notes list a number of articles to support this claim.  The only conclusive evidence in these studies is that self-reported political attitudes (a measure of right-wing authoritarianism) is correlated with self-reported beliefs that the world is dangerous (Duckitt et al., JPSP, 2002, 2 studies, z = 5.42, 6.93).  The correlation between self-report measures is hardly evidence for unconscious physical safety motives.

Another claim is that “our biological mandate to reproduce can have surprising manifestations in today’s world.”   This claim is linked to a study that examined the influence of physical attractiveness on call backs for a job interview.  In a large field experiment, researchers mailed (N = 11,008 resumes) to real job ads and found that both men and women were more likely to be called for an interview if the application included a picture of a highly attractive applicant versus a not so attractive applicant (Busetta et al., 2013, z = 19.53).  Although this is an interesting and important finding, it is not clear that the human resource offices preference for attractive applicants was driven by their “biological mandate to reproduce.”

Chapter 2: 

Chapter 2 introduces the idea that there is a fundamental connection between physical sensations and social relationships.  “… why today we still speak so easily of a warm friend, or a cold father. We always will. Because the connection between physical and social warmth, and between physical and social coldness, is hardwired into the human brain.”   Only one z-score surpassed the 4-sigma threshold.  This z-score comes from a brain imaging study that found increased sensorimotor activation in response to hand-washing products (soap) after participants had lied in a written email, but not after they had lied verbally;  Schaefer et al., 2015, z = 4.65).  There are two problems with this supporting evidence.  First, z-scores in fMRI studies require a higher threshold than z-scores in other studies because brain imaging studies allow for multiple comparisons that increase the risk of a false positive result (Vul et al., 2009).  More important, even if this finding could be replicated, it does not provide support for the claim that these neurological connections are hard-wired into humans’ brains.

The second noteworthy claim in Chapter 2 is that infants “have a preference for their native language over other languages, even though they don’t yet understand a word.” This claim is not very controversial given ample evidence that humans’ prefer familiar over unfamilar stimuli (Zajonc, 1968, also cited in the book).  However, it is not so easy to study infants’ preferences (after all, they are not able to tell us).  Developmental researchers use a visual attention task to infer preferences. If an infant looks longer at one of two stimuli, it indicates a preference for this stimulus. Kinzler et al. (PNAS, 2007) reported six studies. For five studies, z-scores ranged from 1.85 to 2.92, which is insufficient evidence to draw strong conclusions.  However, Study 6 provided convincing evidence (z = 4.61) that 5-year old children in Boston preferred a native speaker to a child with a French accent. The effect was so strong that 8 children were sufficient to demonstrate it.  However, a study with 5-year olds hardly provides evidence for infants’ preferences. In addition, the design of this study holds all other features constant. Thus, it is not clear how strong this effect is in the real world when many other factors can influence the choice of a friend.

Chapter 3

Chapter 3 introduces the concept of priming. “Primes are like reminders, whether we are aware of the reminding or not”   It uses two examples to illustrate priming with and without awareness. One example implies that people can be aware of the primes that influenced their behavior.  If you are in the airport, smell Cinnabon, and find yourself suddenly in front of the Cinnabon counter you are likely to know that the smell made you think about Cinnabon and decide to eat one. The second example introduces the idea that primes can influence behavior without awareness. If you were caught off in traffic, you may respond more hostile to a transgression of a co-worker without being aware that the earlier experience in traffic influenced your reaction.  The supporting references contain two noteworthy (z > 4) findings that show how priming can be used effectively as reminders (Rogers & Milkman, 2016, Psychological Science, Studies 2a (N = 920, z = 5.45) and Study 5 (N = 305, z = 5.50). In Study 2a, online participants were presented with the following instruction:

In this survey, you will have an opportunity to
support a charitable organization called Gardens
for Health that provides lasting agricultural
solutions to address the problem of chronic
childhood malnutrition. On the 12th page of this
survey, please choose answer “A” for the last
question on that page, no matter your opinion. The
previous page is Page 1. You are now on Page 2.
The next page is Page 3. The picture below will
be on top of the NEXT button on the 12th page.
This is intended to remind you to select
answer “A” for the last question on that page. If you
follow these directions, we will donate $0.30 to
Gardens for Health.

Elefant.png

On pages 2-11 participants either saw distinct animals or other elephants.

elefant2.png

 

Participants in the distinct animal condition were more likely to press the response that led to a donation than participants who saw a variety of elephants (z = 5.45).

Study 5 examined whether respondents would be willing to pay for a reminder.  They were offered 60 cents extra payment for responding with “E” to the last question.  They could either pay 3 cents to get an elephant reminder or not.  53% of participants were willing to pay for the reminder, which the authors compared to 0, z = 2 × 10^9.  This finding implies that participants are not only aware of the prime when they respond in the primed way, but are also aware of this link ahead of time and are willing to pay for it.

In short, Chapter 3 introduces the idea of unconscious or automatic priming, but the only solid evidence in the reference section supports the notion that we can also be consciously aware of priming effects and use them to our advantage.

Chapter 4

Chapter 4 introduces the concept of arousal transfer; the idea that arousal from a previous event can linger and influence how we react to another event.  The book reports in detail a famous experiment by Dutton and Aaron (1974).

“In another famous demonstration of the same effect, men who had just crossed a rickety pedestrian bridge over a deep gorge were found to be more attracted to a woman they met while crossing that bridge. How do we know this? Because they were more likely to call that woman later on (she was one of the experimenters for the study and had given these men her number after they filled out a survey for her) than were those who met the same woman while crossing a much safer bridge. The men in this study reported that their decision to call the woman had nothing to do with their experience of crossing the scary bridge. But the experiment clearly showed they were wrong about that, because those in the scary-bridge group were more likely to call the woman than were those who had just crossed the safe bridge.”

First, it is important to correct the impression that men were asked about their reasons to call back.  The original article does not report any questions about motives.  This is the complete section in the results that mentions the call back.

“Female interviewer. In the experimental group, 18 of the 23 subjects who agreed to
the interview accepted the interviewer’s phone number. In the control group, 16 out of 22 accepted (see Table 1). A second measure of sexual attraction was the number of subjects who called the interviewer. In the experimental group 9 out of 18 called, in the control group 2 out of 16 called (x2 = 5.7, p < .02). Taken in conjunction with the sexual
imagery data, this finding suggests that subjects in the experimental group were more
attracted to the interviewer.”

A second concern is that the sample size was small and the evidence for the effect was not very strong.  In the experimental group 9 out of 18 called, in the control
group 2 out of 16 called (x2 = 5.7, p < .02) [z = 2.4].

Finally, the authors mention a possible confound in this field study.  It is possible that men who dared to cross the suspension bridge differ from men who crossed the safe bridge, and it has been shown that risk taking men are more likely to engage in casual sex.  Study 3 addressed this problem with a less colorful, but more rigorous experimental design.

Male students were led to believe that they were participants in a study on electric shock and learning.  An attractive female confederate (a student working with the experimenter but pretending to be a participants) was also present.  The study had four conditions. Male participants were told that they would receive weak or strong shock and they were told that the female confederate would receive weak or strong shock.  They then were asked to fill out a questionnaire before the study would start; in fact, the study ended after participants completed the questionnaire and they were told about the real purpose of the study.

The questionnaire contained two questions about the attractive female confederate. “How much would you like to kiss her?” and “How much would you like to ask her out on a date?”  Participants who were anticipating strong shock had much higher average ratings than those who anticipated weak shock, z = 4.46.

Although this is a strong finding, we also have a large literature on emotions and arousal that suggests frightening your date may not be the best way to get to second base (Reisenzein, 1983; Schimmack, 2005).  It is also not clear whether arousal transfer is a conscious or unconscious process. One study cited in the book found that exercise did not influenced sexual arousal right away, presumably because participants attributed their increased heart rate to the exercise. This suggests that arousal transfer is not entirely an unconscious process.

Chapter 4 also brings up global warming.  An unusually warm winter day in Canada often make people talk about global warming.  A series of studies examined the link between weather and beliefs about global warming more scientifically.  “What is fascinating (and sadly ironic) is how opinions regarding this issue fluctuate as a function of the very climate we’re arguing about. In general, what Weber and colleagues found was that when the current weather is hot, public opinion holds that global warming is occurring, and when the current weather is cold, public opinion is less concerned about global warming as a general threat. It is as if we use “local warming” as a proxy for “global warming.” Again, this shows how prone we are to believe that what we are experiencing right now in the present is how things always are, and always will be in the future. Our focus on the present dominates our judgments and reasoning, and we are unaware of the effects of our long-term and short-term past on what we are currently feeling and thinking.”

One of the four studies produced strong evidence (z = 7.05).  This study showed a correlation between respondents’ ratings of the current day’s temperature and their estimate of the percentage of above average warm days in the past year.  This result does not directly support the claim that we are more concerned about global warming on warm days for two reasons. First, response styles can produce spurious correlations between responses to similar questions on a questionnaire.  Second, it is not clear that participants attributed above average temperatures to global warming.

A third credible finding (z = 4.62) is from another classic study (Ross & Sicoly, 1974, JPSP, Study 2a).  “You will have more memories of yourself doing something than of your spouse or housemate doing them because you are guaranteed to be there when you do the chores. This seems pretty obvious, but we all know how common those kinds of squabbles are, nonetheless. (“I am too the one who unloads the dishwasher! I remember doing it last week!”)”   In this study, 44 students participated in pairs. They were given separate pieces of information and exchange information to come up with a joint answer to a set of questions.  Two days later, half of the participants were told that they performed poorly, whereas the other half was told that they performed well. In the success condition, participants were more likely to make self-attributions (i.e., take credit) than expected by chance.

Chapter 5

In Chapter 5, John Bargh tell us about work by his supervisor Robert Zajonc (1968).  “Bob was doing important work on the mere exposure effect, which is, basically, our tendency to like new things more, the more often we encounter them. In his studies, he repeatedly showed that we like them more just because they are shown to us more often, even if we don’t consciously remember seeing them”  The 1968 classic article contains two studies with strong evidence (Study 2, z = 6.84, Study 3 z = 5.81).  Even though the sample sizes were small, this was not a problem because the studies presented many stimuli at different frequencies to all participants. This makes it easy to spot reliable patterns in the data.

zajonc

 

 

 

 

 

 

 

Chapter 5 also introduces the concept of affective priming.  Affective priming refers to the tendency to respond emotionally to a stimulus even if a task demands to ignore it.  We simply cannot help to feel good or bad and turn our emotions off.  The experimental way to demonstrate this is to present an emotional stimulus quickly followed by a second emotional stimulus. Participants have to respond to the second stimulus and ignore the first stimulus.  It is easier to perform the task when the two stimuli have the same valence, suggesting that the valence of the first stimulus was processed even though participants had to ignore it.  Bargh et al. (1996, JESP) reported that this even happens  when the task is simply to pronounce the second word (Study 1 z = 5.42, Study 2 z = 4.13, Study 3, z = 3.97).

The book does not inform readers that we have to distinguish two types of affective priming effects.  Affective priming is a robust finding when participants’ task is to report on the valence (is it good or bad) of the second stimulus following the prime.  However, this finding has been interpreted by some researches as an interference effect, similar to the Stroop effect.  This explanation would not predict effects on a simple pronounciation task.  However, there are fewer studies with the pronounciation task and some of these have failed to replicate Bargh et al.’s original findings, despite the strong evidence observed in their studies. First, Klauer and Musch (2001) failed to replicate Bargh et al.’s findings that affective priming influences pronunciation of target words in three studies with good statistical power. Second, DeHouwer et al. (2001) were able to replicate it with degraded primes, but also failed to replicate the effect with visible primes that were used by Bargh et al.  In conclusion, affective priming is a robust effect when participants have to report on the valence of the second stimulus, but this finding does not necessarily imply that primes unconsciously activate related content in memory.

Chapter 5 also reports about some surprising associations between individuals’ names, or better their initials, and the places they live, professions, and partners. These correlations are relatively small, but they are based on large datasets and very unlikely to be just statistical flukes (z-scores ranging from 4.65 to 49.44).  The causal process underlying these correlations is less clear.  One possible explanation is that we have unconscious preferences that influence our choices. However, experimental studies that tried to study this effect in the laboratory are less convincing.  Moreover, Hodson and Olson failed to find a similar effect across a variety of domains such as liking of animals (Alicia is not more likely to like ants than Samantha), foods, or leisure activities. They found a significant correlation for brand names (p = .007), but this finding requires replication.   More recently, Kooti, Magno, and Weber (2014) examined name effects on social media. They found significant effects for some brand comparisons (Sega vs. Nintendo), but not for others (Pepsi vs. Coke).  However, they found that twitter users were more likely to follow other twitter uses with the same first name. Taken together, these results suggest that individuals’ names predict some choices, but it is not clear when or why this is the case.

The chapter ends with a not very convincing article (z = 2.39, z = 2.22) that it is actually very easy to resist or override unwanted priming effects. According to this article, simply being told that somebody is a team member can make automatic prejudice go away.  If it were so easy to control unwanted feelings, it is not clear why racism is still a problem 50 years after the civil rights movement started.

In conclusion Chapter 5 contains a mix of well-established findings with strong support (mere-exposure effects, affective priming) and several less supported ideas. One problem is that priming is sometimes presented as an unconscious process that is difficult to control and at other times these effects seem to be easily controllable. The chapter does not illuminate under which conditions we should suspect priming to influence our behavior in ways we don’t notice or cannot control and when we notice them and have the ability to control them.

Chapter 6

Chapter 6 deals with the thorny problem in psychological science that most theories make correct predictions sometimes. Even a broken clock tells the time right twice a day. The problem is to know in which context a theory makes correct predictions and when it does not.

“Entire books—bestsellers—have appeared in recent years that seem to give completely conflicting advice on this question: can we trust our intuitions (Blink, by Malcolm Gladwell), or not (Thinking, Fast and Slow, by Daniel Kahneman)? The answer lies in between. There are times when you can and should, and times when you can’t and shouldn’t [trust your gut].

Bargh then proceeds to make 8 evidence-based recommendation when it is advantages to rely on intuition without effortful deliberation (gut feelings).

Rule #1: supplement your gut impulse with at least a little conscious reflection, if you have the time to do so.

Rule # 2: when you don’t have the time to think about it, don’t take big chances for small gains going on your gut alone.

Rule #3: when you are faced with a complex decision involving many factors, and especially when you don’t have objective measurements (reliable data) of those important factors, take your gut feelings seriously.

Rule #4: be careful what you wish for, because your current goals and needs will color what you want and like in the present.

Rule #5: when our initial gut reaction to a person of a different race or ethnic group is negative, we should stifle it.

Rule #6: we should not trust our appraisals of others based on their faces alone, or on photographs, before we’ve had any interaction with them.

Rule #7: (it may be the most important one of all): You can trust your gut about other people—but only after you have seen them in action.

Rule #8: it is perfectly fine for attraction be one part of the romantic equation, but not so fine to let it be the only, or even the main, thing.

Unfortunately, the credible evidence in this chapter (z > 4) is only vaguely related to these rules and insufficient to claim that these rules are based on solid scientific evidence.

Morewedge and Norton (2009) provide strong evidence that people in different cultures (US z = 4.52, South Korea z = 7.18, India z = 6.78) believe that dreams provide meaningful information about themselves.   Study 3 used a hypothetical scenario to examine whether people would change their behavior in response to a dream.  Participants were more likely to say that they would change a flight after dreaming about a plane crash in the night before the flight than if they thought about a plane crash the evening before and dreams influenced behavior about as much as hearing about an actual plane crash (z = 10.13).   In a related article, Morewedge and colleagues (2014) asked participants to rate types of thoughts (e.g., dreams, problem solving, etc.) in terms of spontaneity or deliberation. A second rating asked about the extent to which the type of thought would generate self-insight or is merely a reflection of the current situation.  They found that spontaneous thoughts were considered to generate more self-insight (Study 1 z = 5.32, Study 2 z = 5.80).   In Study 5, they also found that more spontaneous recollection of a recent positive or negative experience with their romantic partner predicted hypothetical behavioral intention ratings (““To what extent might recalling the experience affect your likelihood of ending the relationship, if it came to mind when you tried to remember it”) (z = 4.06). These studies suggest that people find spontaneous, non-deliberate thoughts meaningful and that they are willing to use them in decision making.  The studies do not tell us under which circumstances listening to dreams and other spontaneous thoughts (gut feelings) is beneficial.

Inbar, Cone, and Gilovich (2010) created a set of 25 choice problems (e.g., choosing an entree, choosing a college).  They found that “the more a choice was seen as objectively evaluable, the more a rational approach was seen as the appropriate choice strategy” (Study 1a, z = 5.95).  In a related study, they found “the more participants
thought the decision encouraged sequential rather than holistic processing, the more they thought it should be based on rational analysis” (Study 1b, z = 5.02).   These studies provide some insight into people’s beliefs about optimal decision rules, but they do not tell us whether people’s beliefs are right or wrong, which would require to examine people’s actual satisfaction with their choices.

Frederick (2005) examined personality differences in the processing of simple problems (e.g., A bat and a ball cost $1.10. The bat costs $1.00 more than the ball. How much does the ball cost?).  The quick answer is 10 cent, but the correct answer is 5 cent.  In this case, the gut response is false.  A sample of over 3,000 participants answered several similar questions. Participants who performed above average were more willing to delay gratification (get $3,800 in a month rather than $3,400 now) than participants with below average performance (z > 5).  If we consider the bigger reward a better choice, these results imply that it is not good to rely on gut responses when it is possible to use deliberation to get the right answer.

Two studies by Wilson and Schooler (1991) are used to support the claim that we can overthink choices.

“In their first study, they had participants judge the quality of different brands of jam, then compared their ratings with those of experts. They found that the participants who were asked to spend time consciously analyzing the jam had preferences that differed further from those of the experts, compared to those who responded with just the “gut” of their taste buds.”  The evidence in this study with a small sample is not very strong and requires replication  (N = 49, z = 2.36).

“In Wilson and Schooler’s second study, they interviewed hundreds of college students about the quality of a class. Once again, those who were asked to think for a moment about their decisions were further from the experts’ judgments than were those who just went with their initial feelings.”

Wilson.Schooler

The description in the book does not match the actual study.  There were three conditions.  In the control condition, participants were asked to read the information about the courses carefully.  In the reasons condition, participants were asked to write down their reasons. and in the rate all condition participants were asked to rate all pieces of information, no matter how important, in terms of its effect on their choices. The study showed that considering all pieces of information increased the likelihood of choosing a poorly rated course (a bad choice), but had a much smaller effect on ratings of highly rated courses (z = 4.14 for the interaction effect).  All conditions asked for some reflection and it remains unclear how students would respond if they went with their initial feelings, as described in the book.  Nevertheless, the study suggests that good choices require focusing on important factors and paying attention to trivial factors can lead to suboptimal choices.  For example, real estate agents in hot markets use interior design to drive up prices even though the design is not part of the sale.

We are born sensitive to violations of fair treatment and with the ability to detect those who are causing harm to others, and assign blame and responsibility to them. Recent research has shown that even children three to five years old are quite sensitive to fairness in social exchanges. They preferred to throw an extra prize (an eraser) away than to give more to one child than another—even when that extra prize could have gone to themselves. This is not an accurate description of the studies.  Study 1 (z > 5) found that 6 to 8 year old children preferred to give 2 erasers to one kid and 2 erasers to another kid and to throw the fifth eraser away to maintain equality (20 out of 20, p < .0001).  However, “the 3-to 5-year-olds showed no preference to throw a resource away (14 out of 24, p = .54)” (p. 386).  Subsequent studies used only 6-8 year old children. Study 4 examined how children would respond if erasers are divided between themselves and another kid. 17 out of 20 (p = .003, z = 2.97 preferred to throw the eraser away rather than getting one more for themselves.  However, in a related article, Shaw and Olson, 2012b) found that children preferred favoritism (getting more erasers) when receiving more erasers was introduced as winning a contest (Study 2, z = 4.65). These studies are quiet interesting, but they do not support the claim that equality norms are inborn, nor do they help us to figure out when we should or should not listen to our gut or whether it is better for us to be equitable or selfish.

The last, but in my opinion most interesting and relevant, piece of evidence in Chapter 6 is a large (N = 16,624) survey study of relationship satisfaction (Cacioppo et al., 2013, PNAS, z = 6.58).   Respondents reported their relationship satisfaction and how they had met.   Respondents who had met their partner online were slightly more satisfied than respondents who had met their partner offline.  There were also differences between different types of meeting online.  Respondents who met their partner in a bar had one of the lowest average level of satisfaction.  The study did not reveal why online dating is slightly more successful, but both forms of dating probably involve a combination of deliberation and “gut” reactions.

In conclusion, Chapter 6 provides some interesting insights into the way people make choices. However, the evidence does not provide a scientific foundation for recommendations when it is better to follow your instinct and when it is better to rely on logical reasoning and deliberation.  Either the evidence of the reviewed studies is too weak or the studies do not use actual choice outcomes as outcome variable. The comparison of online and offline dating is a notable exception.

Chapter 7

Chapter 7 uses an impressive field experiment to support the idea that “our mental representations of concepts such as politeness and rudeness, as well as innumerable other behaviors such as aggression and substance abuse, become activated by our direct perception of these forms of social behavior and emotion, and in this way are contagious.”   Keizer et al. (2008) conducted the study in an alley in Groningen, a city in the Netherlands.  In one condition, bikes were parked in front of a wall with graffiti, despite an anti-graffiti sign.  In the other condition, the wall was clean.  Researchers attached fliers to the bikes and recorded how many users would simply throw the fliers on the ground.  They recorded the behaviors of 77 bike riders in each condition. In the graffiti condition, 69% of riders littered. In the clean condition, only 33% of riders littered (z = 4.51).

Kreizer.Study1.png

In Study 2, the researchers put up a fence in front of the entrance to a car park that required car owners to walk an extra 200m to get to their car, but they left a gap that allowed car owners to avoid the detour.  There was also a sign that forbade looking bikes to the fence.  In one condition, bikes were not locked to the fence. In the experimental condition, the norm was violated and four bikes were locked to the fence.  41 car owners’ behaviors were observed in each condition.  In the experimental condition, 82% of participants stepped through the gap. In the control condition, only 27% of car owners stepped through the gap (z = 5.27).

Keizer.Study2.png

It is unlikely that bike riders or car owners in these studies consciously processed the graffiti or the locked bikes.  Thus, these studies support the hypothesis that our environment can influence behavior in subtle ways without our awareness.  Moreover, these studies show these effects with real-world behavior.

Another noteworthy study in Chapter 7 examined happiness in social networks (Fowler & Christakis, 2008).   The authors used data from the Framingham Heart Study, which is a unique study where most inhabitants of a small town, Framingham, participated in the study.   Researchers collected many measures, including a measure of happiness. They also mapped social relationships among them.  Fowler and Christakis used sophisticated statistical methods to examine whether people who were connected in the social network (e.g., spouses, friends, neighbors) had similar levels of happiness. They did (z = 9.09).  I may be more likely to believe these findings because I have found this in my own research on married couples (Schimmack & Lucas, 2010).  Spouses are not only more similar to each other at one moment in time, they also change in the same direction over time.  However, the causal mechanism underlying this effect is more elusive.  Maybe happiness is contagious and can spread through social networks like a disease. However, it is also possible that related members in social networks are exposed to similar environments.  For example, spouses share a common household income and money buys some happiness.  It is even less clear whether these effects occur outside of people’s awareness or not.

Chapter 8 ends with the positive message that a single person can change the word because his or her actions influence many people. “The effect of just one act, multiplies and spreads to influence many other people. A single drop becomes a wave”  This rosy conclusion overlooks that the impact of one person decreases exponentially when it spreads over social networks. If you are kind to a neighbor, the neighbor may be slightly more likely to be kind to the pizza delivery man, but your effect on the pizza delivery man is already barely noticeable.  This may be a good thing when it comes to the spreading of negative behaviors.  Even if the friend of a friend is engaging in immoral behaviors, it doesn’t mean that you are more likely to commit a crime. To really change society it is important to change social norms and increase individuals’ reliance on these norms even when situational influences tell them otherwise.   The more people have a strong norm not to litter, the less it matters whether there are graffiti on the wall or not.

Chapter 8

Chapter 8 examines dishonesty and suggests that dishonesty is a general human tendency. “When the goal of achievement and high performance is active, people are more likely to bend the rules in ways they’d normally consider  dishonest and immoral, if doing so helps them attain their performance goal”

Of course, not all people cheat in all situations even if they think they can get away with it.  So, the interesting scientific question is who will be dishonest in which context?

Mazar et al. (2008) examined situational effects on dishonesty.  In Study 2 (z = 4.33) students were given an opportunity to cheat in order to receive a higher reward. The study had three conditions, a control condition that did not allow students to cheat, a cheating condition, and a cheating condition with an honor pledge.  In the honor pledge condition, the test started with the sentence “I understand that this short survey falls under MIT’s [Yale’s] honor system”.   This manipulation eliminated cheating.  However, even in the cheating condition “participants cheated only 13.5% of the possible average magnitude.  Thus, MIT/Yale students are rather honest or the incentive was too small to tempt them (an extra $2).  Study 3 found that students were more likely to cheat if they were rewarded with tokens rather than money, even though they later could exchange tokens for money.  The authors suggests that cheating merely for tokens rather than real money made it seem less like “real” cheating (z = 6.72).

Serious immoral acts cannot be studied experimentally in a psychology laboratory.  Therefore, research on this topic has to rely on self-report and correlations. Pryor (1987) developed a questionnaire to study “Sexual Harassment Proclivities in Men.”  The questionnaire asks men to imagine being in a position of power and to indicate whether they would take advantage of their power to incur sexual favors if they know they can get away with it.  To validate the scale, Pryor showed that it correlated with a scale that measures how much men buy into rape myths (r = .40, z = 4.47).   Self-reports on these measures have to be taken with a grain of salt, but the results suggest that some men are willing to admit that they would abuse power to gain sexual favors, at least in anonymous questionnaires.

Another noteworthy study found that even prisoners are not always dishonest. Cohn et al. (2015) used a gambling task to study dishonesty in 182 prisoners in a maximum security prison.  Participants were given the opportunity to flip 10 coins and to keep all coins that showed head.  Importantly, the coin toss was not observed.  As it is possible, although unlikely, that all 10 coins show head by chance, inmates could keep all coins and hide behind chance.  The randomness of the outcome makes it impossible to accuse a particular prisoner of dishonesty.  Nevertheless, the task makes it possible to measure dishonesty of the group (collective dishonesty) because the percentage of coin tosses that were reported should be close to chance (50%). If it is significantly higher than chance, it shows that some prisoners were dishonest. On average, prisoners reported 60% head, which reveals some dishonesty, but even convicted criminals were more likely to respond honestly than not (the percentage increased from 60% to 66% when they were primed with their criminal identity, z = 2.39).

I see some parallels between the gambling task and the world of scientific publishing, at least in psychology.  The outcome of a study is partially determined by random factors. Even if a scientist does everything right, a study may produce a non-significant result due to random sampling error. The probability of observing a non-significant result is called a type-II error. The probability of observing a significant result is called statistical power.  Just like in a coin toss experiment, the observed percentage of significant results should match the expected percentage based on average power.  Numerous studies have shown that researchers report more significant results than the power of their studies justifies. As in the coin toss experiment, it is not possible to point the finger at a single outcome because chance might have been in a researcher’s favor, but in the long run the odds “cannot be always in your favor” (Hunger Games).  Psychologists disagree whether the excess of significant results in psychology journals should be attributed to dishonesty.  I think it is and it fits Bargh’s observation that humans, and most scientists are humans, have a tendency to bend the rules when doing so helps them to reach their goal, especially when the goal is highly relevant (e.g., get a job, get a grant, get tenure). Sadly, the extent of over-reporting significant results is considerably larger than the 10 to 15% overreporting of heads in the prisoner study.

Chapter 9

Chapter 9 introduces readers to Metcalfe’s work on insight problems (e.g., how to put 27 animals into 4 pens so that there is an odd number of animals in all four pens).  Participants had to predict quickly whether they would be able to solve the problem. They then got 5 minutes to actually solve the problem. Participants were not able to predict accurately which insight problems they would solve.  Metcalfe concluded that the solution for insight problems comes during a moment of sudden illumination that is not predictable.  Bargh adds “This is because the solver was working on the problem unconsciously, and when she reached a solution, it was delivered to her fully formed and ready for use.”  In contrast, people are able to predict memory performance on a recognition test, even when they were not able to recall the answer immediately.  This phenomenon is known as the tip-of-the-tongue effect (z = 5.02).  This phenomenon shows that we have access to our memory even before we can recall the final answer.  This phenomenon is similar to the feeling of familiarity that is created by mere exposure (Zajonc, 1968). We often know a face is familiar without being able to recall specific memories where we encountered it.

The only other noteworthy study in Chapter 9 was a study of sleep quality (Fichten et al., 2001).  “The researchers found that by far, the most common type of thought that kept them awake, nearly 50 percent of them, was about the future, the short-term events coming up in the next day or week. Their thoughts were about what they needed to get done the following day, or in the next few days.”   It is true that 48% thought about future short-term events, but only 1% described these thoughts as worries, and 57% of these thoughts were positive.  It is not clear, however, whether this category distinguished good and poor sleepers.  What distinguished good sleepers from poor sleepers, especially those with high distress, was the frequency of negative thoughts (z = 5.59).

Chapter 10

Chapter 10 examines whether it is possible to control automatic impulses. Ample research by personality psychologists suggests that controlling impulses is easier for some people than others.  The ability to exert self-control is often measured with self-report measures that predict objective life outcomes.

However, the book adds a twist to self-control. “The most effective self-control is not through willpower and exerting effort to stifle impulses and unwanted behaviors. It comes from effectively harnessing the unconscious powers of the mind to much more easily do the self-control for you.”

There is a large body of strong evidence that some individuals, those with high impulse control and conscientiousness, perform better academically or at work (Tangney et al., 2004; Study 1 z = 5.90, Galla & Duckworth, Studies 1, 4, & 6, Ns = 488, 7.62, 5.18).  Correlations between personality measures and outcomes do not reveal the causal mechanism that leads to these positive outcomes.  Bargh suggests that individuals who score high on self-control measures are “the ones who do the good things less consciously, more automatically, and more habitually. And you can certainly do the same.”   This maybe true, but empirical work to demonstrate this is hard to find.  At the end of the chapter, Bargh cites a recent study by Milyavskaya and Michael Inzlicht that suggested avoiding temptations is more important than being able to exert self-control in the face of temptation, willful or unconsciously.

Conclusion

The book “Before you know it: The unconscious reasons we do what we do” is based on the authors’ personal experiences, studies he has conducted, and studies he has read. The author is a scientist and I have no doubt that he shares with his readers insights that he believes to be true.  However, this does not automatically make them true.  John Bargh is well aware that many psychologists are skeptical about some of the findings that are used in the book.  Famously, some of Bargh’s own studies have been difficult to replicate.  One response to concerns about replicability could have been new demonstrations that important unconscious priming effects can be replicated. In an interview Tom Bartlett (January, 2013) suggested this to John Bargh.

“So why not do an actual examination? Set up the same experiments again, with additional safeguards. It wouldn’t be terribly costly. No need for a grant to get undergraduates to unscramble sentences and stroll down a hallway. Bargh says he wouldn’t want to force his graduate students, already worried about their job prospects, to spend time on research that carries a stigma. Also, he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.”

Beliefs are subjective.  Readers of the book have their own beliefs and may find part of the book interesting and may be willing to change some of their beliefs about human behavior.  Not that there is anything wrong with this, but readers should also be aware that it is reasonable to treat the ideas presented in this book with a healthy does of skepticism.  In 2011, Daniel Kahneman wrote ““disbelief is not an option. The results are not made up, nor are they statistical flukes. You have no choice but to accept that the major conclusions of these studies are true.”  Five years later, it is pretty clear that Kahneman is more skeptical about the state of priming research and results of experiments with small samples in general.  Unfortunately, it is not clear which studies we can believe until replication studies distinguish real effects from statistical flukes. So, until we have better evidence, we are still free to belief what we want about the power of unconscious forces on our behavior.

 

Advertisements

(Preprint) Z-Curve: A Method for Estimating Replicability Based on Test Statistics in Original Studies (Schimmack & Brunner, 2017)

In this PDF document, Jerry Brunner and I would like to share our latest manuscript on z-curve,  a method that estimates average power of a set of studies selected for significance.  We call this estimate replicabilty because average power determines the success rate if the set of original studies were replicated exactly.

We welcome all comments and criticism as we plan to submit this manuscript to a peer-reviewed journal by December 1.

Highlights

Comparison of P-curve and Z-Curve in Simulation studies

Estimate of average replicability in Cuddy et al.’s (2017) P-curve analysis of power posing with z-curve (30% for z-curve vs. 44% for p-curvce).

Estimating average replicability in psychology based on over 500,000 significant test statitics.

Comparing automated extraction of test statistics and focal hypothesis tests using Motyl et al.’s (2016) replicability analysis of social psychology.

 

 

Preliminary 2017 Replicability Rankings of 104 Psychology Journals

The table shows the preliminary 2017 rankings of 104 psychology journals.  A description of the methodology and analyses of by discipline and time are reported below the table.

Rank   Journal 2017 2016 2015 2014 2013 2012 2011 2010
1 European Journal of Developmental Psychology 93 88 67 83 74 71 79 65
2 Journal of Nonverbal Behavior 93 72 66 74 81 73 64 70
3 Behavioral Neuroscience 86 67 71 70 69 71 68 73
4 Sex Roles 83 83 75 71 73 78 77 74
5 Epilepsy & Behavior 82 82 82 85 85 81 87 77
6 Journal of Anxiety Disorders 82 77 73 77 76 80 75 77
7 Attention, Perception and Psychophysics 81 71 73 77 78 80 75 73
8 Cognitive Development 81 73 82 73 69 73 67 65
9 Judgment and Decision Making 81 79 78 78 67 75 70 74
10 Psychology of Music 81 80 72 73 77 72 81 86
11 Animal Behavior 80 74 71 72 72 71 70 78
12 Early Human Development 80 92 86 83 79 70 64 81
13 Journal of Experimental Psychology – Learning, Memory & Cognition 80 80 79 80 77 77 71 81
14 Journal of Memory and Language 80 84 81 74 77 73 80 76
15 Memory and Cognition 80 75 79 76 77 78 76 76
16 Social Psychological and Personality Science 80 67 61 65 61 58 63 55
17 Journal of Positive Psychology 80 70 72 72 64 64 73 81
18 Archives of Sexual Behavior 79 79 81 80 83 79 78 87
19 Consciousness and Cognition 79 71 69 73 67 70 73 74
20 Journal of Applied Psychology 79 80 74 76 69 74 72 73
21 Journal of Experimental Psychology – Applied 79 67 68 75 68 74 74 72
22 Journal of Experimental Psychology – General 79 75 73 73 76 69 74 69
23 Journal of Experimental Psychology – Human Perception and Performance 79 78 76 77 76 78 78 75
24 Journal of Personality 79 75 72 68 72 75 73 82
25 JPSP-Attitudes & Social Cognition 79 57 75 69 50 62 61 61
26 Personality and Individual Differences 79 79 79 78 78 76 74 73
27 Social Development 79 78 66 75 73 72 73 75
28 Appetite 78 74 69 66 75 72 74 77
29 Cognitive Behavioral Therapy 78 82 76 65 72 82 71 62
30 Journal of Comparative Psychology 78 77 76 83 83 75 69 64
31 Journal of Consulting and Clinical Psychology 78 71 68 65 66 66 69 68
32 Neurobiology of Learning and Memory 78 72 75 72 71 70 75 73
33 Psychonomic Bulletin and Review 78 79 82 79 82 72 71 78
34 Acta Psychologica 78 75 73 78 76 75 77 75
35 Behavior Therapy 77 74 71 75 76 78 64 76
36 Journal of Affective Disorders 77 85 84 77 83 82 76 76
37 Journal of Child and Family Studies 77 76 69 71 76 71 76 77
38 Journal of Vocational Behavior 77 85 84 69 82 79 86 74
39 Motivation and Emotion 77 64 67 66 67 65 79 68
40 Psychology and Aging 77 79 78 80 74 78 78 74
41 Psychophysiology 77 77 70 69 68 70 80 78
42 Britsh Journal of Social Psychology 76 65 66 62 64 60 72 63
43 Cognition 76 74 75 75 77 76 73 73
44 Cognitive Psychology 76 80 74 76 79 72 82 75
45 Developmental Psychology 76 77 77 75 71 68 70 70
46 Emotion 76 72 69 69 72 70 70 73
47 Frontiers in Behavioral Neuroscience 76 70 71 68 71 72 73 70
48 Frontiers in Psychology 76 75 73 73 72 72 70 82
49 Journal of Autism and Developmental Disorders 76 77 73 67 73 70 70 72
50 Journal of Social and Personal Relationships 76 82 60 63 69 67 79 83
51 Journal of Youth and Adolescence 76 88 81 82 79 76 79 74
52 Cognitive Therapy and Research 75 71 72 62 77 75 70 66
53 Depression & Anxiety 75 78 73 76 82 79 82 84
54 Journal of Child Psychology and Psychiatry and Allied Disciplines 75 63 66 66 72 76 58 66
55 Journal of Occupational and Organizational Psychology 75 85 84 71 77 77 74 67
56 Journal of Social Psychology 75 75 74 67 65 80 71 75
57 Political Psychology 75 81 75 72 75 74 51 70
58 Social Cognition 75 68 68 73 62 78 71 60
59 British Journal of Developmental Psychology 74 77 74 63 61 85 77 79
60 Evolution & Human Behavior 74 81 75 79 67 77 78 68
61 Journal of Research in Personality 74 77 82 80 79 73 74 71
62 Memory 74 79 66 83 73 71 76 78
63 Psychological Medicine 74 83 71 79 79 68 79 75
64 Psychopharmacology 74 75 73 73 71 73 73 71
65 Psychological Science 74 69 70 64 65 64 62 63
66 Behavioural Brain Research 73 69 75 69 71 72 73 74
67 Behaviour Research and Therapy 73 74 76 77 74 77 68 71
68 Journal of Cross-Cultural Psychology 73 75 80 78 78 71 76 76
69 Journal of Experimental Child Psychology 73 73 78 74 74 72 72 76
70 Personality and Social Psychology Bulletin 73 71 65 65 61 61 62 61
71 Social Psychology 73 75 72 74 69 64 75 74
72 Developmental Science 72 68 68 66 71 68 68 66
73 Journal of Cognition and Development 72 78 68 64 69 62 66 70
74 Law and Human Behavior 72 76 76 61 76 76 84 72
75 Perception 72 78 79 74 78 85 94 91
76 Journal of Applied Social Psychology 71 81 69 72 71 80 74 75
77 Journal of Experimental Social Psychology 71 68 63 61 58 56 58 57
78 Annals of Behavioral Medicine 70 70 62 71 71 77 75 71
79 Frontiers in Human Neuroscience 70 74 73 74 75 75 75 72
80 Health Psychology 70 63 68 69 68 63 70 72
81 Journal of Abnormal Child Psychology 70 74 70 74 78 78 68 78
82 Journal of Counseling Psychology 70 69 74 75 76 78 67 80
83 Journal of Educational Psychology 70 74 73 76 76 78 78 84
84 Journal of Family Psychology 70 68 75 71 73 66 68 69
85 JPSP-Interpersonal Relationships and Group Processes 70 74 64 62 66 58 60 56
86 Child Development 69 72 72 71 69 75 72 75
87 European Journal of Social Psychology 69 76 64 72 67 59 69 66
88 Group Processes & Intergroup Relations 69 67 73 68 70 66 68 61
89 Organizational Behavior and Human Decision Processes 69 73 70 70 72 70 71 65
90 Personal Relationships 69 72 71 70 68 74 60 69
91 Journal of Pain 69 79 71 81 73 78 74 72
92 Journal of Research on Adolescence 68 78 69 68 75 76 84 77
93 Self and Identity 66 70 56 73 71 72 70 73
94 Developmental Psychobiology 65 69 67 69 70 69 71 66
95 Infancy 65 61 57 65 70 67 73 57
96 Hormones & Behavior 64 68 66 66 67 64 68 67
97 Journal of Abnormal Psychology 64 67 71 64 71 67 73 70
98 JPSP-Personality Processes and Individual Differences 64 74 70 70 72 71 71 64
99 Psychoneuroendocrinology 64 68 66 65 65 62 66 63
100 Cognition and Emotion 63 69 75 72 76 76 76 76
101 European Journal of Personality 62 78 66 81 70 74 74 78
102 Biological Psychology 61 68 70 66 65 62 70 70
103 Journal of Happiness Studies 60 78 79 72 81 78 80 83
104 Journal of Consumer Psychology 58 56 69 66 61 62 61 66

 

ranking.ggplot

Download PDF of this ggplot representation of the table courtesy of David Lovis-McMahon.

 

 

 

 

 

Introduction

I define replicability as the probability of obtaining a significant result in an exact replication of a study that produced a significant result.  In the past five years, there have been concerns about a replication crisis in psychology.  Even results that are replicated internally by the same author multiple times fail to replicate in independent replication attempts (Bem, 2011).  The key reason for the replication crisis is selective publishing of significant results (publication bias). While journals report over 95% significant results (Sterling, 1959; Sterling et al., 1995), a 2015 article estimated that less than 50% of these results can be replicated  (OSC, 2015).

The OSC reproducibility made an important contribution by demonstrating that published results in psychology have low replicability.  However, the reliance on actual replication studies has a a number of limitations.  First, actual replication studies are expensive, time-consuming, and sometimes impossible (e.g., a longitudinal study spanning 20 years).  This makes it difficult to rely on actual replication studies to assess the replicability of psychological results, produce replicability rankings of journals, and to track replicability over time.

Schimmack and Brunner (2016) developed a statistical method (z-curve) that makes it possible to estimate average replicability for a set of published results based on the test-statistics reported in published articles.  This statistical approach to the estimation of replicability has several advantages over the use of actual replication studies: (a) replicability can be assessed in real time, (b) it can be estimated for all published results rather than a small sample of studies, and (c) it can be applied to studies that are impossible to reproduce.  Finally, it has the advantage that actual replication studies can be criticized  (Gilbert, King, Pettigrew, & Wilson, 2016). Estimates of replicabilty based on original studies do not have this problem because they are based on results reported in the original articles.

Z-curve has been validated with simulation studies and can be used with heterogeneous sets of studies that vary across statistical methods, sample sizes, and effect sizes  (Brunner & Schimmack, 2016).  I have applied this method to articles published in psychology journals to create replicability rankings of psychology journals in 2015 and 2016.  This blog post presents preliminary rankings for 2017 based on articles that have been published so far. The rankings will be updated in 2018, when all 2017 articles are available.

For the 2016 rankings, I used z-curve to obtain annual replicability estimates for 103 journals from 2010 to 2016.  Analyses of time trends showed no changes from 2010-2015. However, in 2016 there were first signs of an increase in replicabilty.  Additional analyses suggested that social psychology journals contributed mostly to this trend.  The preliminary 2017 rankings provide an opportunity to examine whether there is a reliable increase in replicability in psychology and whether such a trend is limited to social psychology.

Journals

Journals were mainly selected based on impact factor.  Preliminary replicability rankings for 2017 are based on 104 journals. Several new journals were added to increase the number of journals specializing in five disciplines: social (24), cognitive (13), development (15), clinical/medical (18), biological (13).  The other 24 journals were broad journals (Psychological Science) or from other disciplines.  The total number of journals for the preliminary rankings is 104.  More journals will be added to the final rankings for 2017.

Data Preparation

All PDF versions of published articles were downloaded and converted into text files using the conversion program pdfzilla.  Text files were searched for reports of statistical results using a self-created R program. Only F-tests, t-tests, and z-tests were used for the rankings because they can be reliabilty extracted from diverse journals. t-values that were reported without df were treated as z-values which leads to a slight inflation in replicability estimates. However, the bulk of test-statistics were F-values and t-values with degrees of freedom. Test-statistics were converted into exact p-values and exact p-values were converted into absolute z-scores as a measure of the strength of evidence against the null-hypothesis.

Data Analysis

The data for each year were analyzed using z-curve (Schimmack and Brunner (2016). Z-curve provides a replicability estimate. In addition, it generates a Powergraph. A Powergraph is essentially a histogram of absolute z-scores. Visual inspection of Powergraphs can be used to examine publication bias. A drop of z-values on the left side of the significance criterion (p < .05, two-tailed, z = 1.96) shows that non-significant results are underpresented. A further drop may be visible at z = 1.65 because values between z = 1.65 and z = 1.96 are sometimes reported as marginally significant support for a hypothesis.  The critical values z = 1.65 and z = 1.96 are marked by vertical red lines in the Powergraphs.

Replicabilty rankings rely only on statistically significant results (z > 1.96).  The aim of z-curve is to estimate the average probability that an exact replication of a study that produced a significant result produces a significant result again.  As replicability estimates rely only on significant results, journals are not being punished for publishing non-significant results.  The key criterion is how strong the evidence against the null-hypothesis is when an article published results that lead to the rejection of the null-hypothesis.

Statistically, replicability is the average statistical power of the set of studies that produced significant results.  As power is the probabilty of obtaining a significant result, average power of the original studies is equivalent with average power of a set of exact replication studies. Thus, average power of the original studies is an estimate of replicability.

Links to powergraphs for all journals and years are provided in the ranking table.  These powergraphs provide additional information that is not used for the rankings. The only information that is being used is the replicability estimate based on the distribution of significant z-scores.

Results

The replicability estimates for each journal and year (104 * 8 = 832 data points) served as the raw data for the following statistical analyses.  I fitted a growth model to examine time trends and variability across journals and disciplines using MPLUS7.4.

I compared several models. Model 1 assumed no mean level changes and stable variability across journals (significant variance in the intercept/trait). Model 2 assumed no change from 2010 to 2015 and allowed for mean level changes in 2016 and 2017 as well as stable differences between journals. Model 3 was identical to Model 2 and allowed for random variability in the slope factor.

Model 1 did not have acceptable fit (RMSEA = .109, BIC = 5198). Model 2 increased fit (RMSEA = 0.063, BIC = 5176).  Model 3 did not improve model fit (RMSEA = .063, BIC = 5180), the variance of the slope factor was not significant, and BIC favored the more parsimonious Model 2.  The parameter estimates suggested that replicability estimates increased from 72 in the years from 2010 to 2015 by 2 points to 74 (z = 3.70, p < .001).

The standardized loadings of individual years on the latent intercept factor ranged from .57 to .61.  This implies that about one-third of the variance is stable, while the remaining two-thirds of the variance is due to fluctuations in estimates from year to year.

The average of 72% replicability is notably higher than the estimate of 62% reported in the 2016 rankings.  The difference is due to a computational error in the 2016 rankings that affected mainly the absolute values, but not the relative ranking of journals. The r-code for the 2016 rankings miscalculated the percentage of extreme z-scores (z > 6), which is used to adjust the z-curve estimate that are based on z-scores between 1.96 and 6 because all z-scores greater than 6 essentially have 100% power.  For the 2016 rankings, I erroneously computed the percentage of extreme z-scores out of all z-scores rather than limiting it to the set of statistically significant results. This error became apparent during new simulation studies that produced wrong estimates.

Although the previous analysis failed to find significant variability for the slope (change factor), this could be due to the low power of this statistical test.  The next models included disciplines as predictors of the intercept (Model 4) or the intercept and slope (Model 5).  Model 4 had acceptable fit (RMSEA = .059, BIC = 5175), but Model 5 improved fit, although BIC favored the more parsimonious model (RMSEA = .036, BIC = 5178). The Bayesian Information Criterion favors parsimony and better fit cannot be interpreted as evidence for the absence of an effect.  Model 5 showed two significant (p < .05) effects for social and developmental psychology.  In Model 6 I included only social and development as predictors of the slope factor.  BIC favored this model over the other models (RMSEA = .029, BIC = 5164).  The model results showed improvements for social psychology (increase by 4.48 percentage points, z = 3.46, p = .001) and developmental psychology (increase by 3.25 percentage points, z = 2.65, p = .008).  Whereas the improvement for social psychology was expected based on the 2016 results, the increase for developmental psychology was unexpected and requires replication in the 2018 rankings.

The only significant predictors for the intercept were social psychology (-4.92 percentage points, z = 4.12, p < .001) and cognitive psychology (+2.91, z = 2.15, p = .032).  The strong negative effect (standardized effect size d = 1.14) for social psychology confirms earlier findings that social psychology was most strongly affected by the replication crisis (OSC, 2015). It is encouraging to see that social psychology is also the discipline with the strongest evidence for improvement in response to the replication crisis.  With an increase by 4.48 points, replicabilty of social psychology is now at the same level as other disciplines in psychology other than cognitive psychology, which is still a bit more replicable than all other disciplines.

In conclusion, the results confirm that social psychology had lower replicability than other disciplines, but also shows that social psychology has significantly improved in replicabilty over the past couple of years.

Analysis of Individual Journals

The next analysis examined changes in replicabilty at the level of individual journals. Replicability estimates were regressed on a dummy variable that contrasted 2010-1015 (0) with 2016-2017 (1). This analysis produced 10 significant increases with p < .01 (one-tailed), when only 1 out of 100 would be expected by chance.

Five of the 10 journals (50% vs. 20% in the total set of journals) were from social psychology (SPPS + 13, JESP + 11, JPSP-IRGP + 11, PSPB + 10, Sex Roles + 8).  The remaining journals were from developmental psychology (European J. Dev. Psy + 17, J Cog. Dev. + 9), clinical psychology (J. Cons. & Clinical Psy + 8, J. Autism and Dev. Disorders + 6), and the Journal of Applied Psychology (+7).  The high proportion of social psychology journals provides further evidence that social psychology has responded most strongly to the replication crisis.

 

Limitations

Although z-curve provides very good absolute estimates of replicability in simulation studies, the absolute values in the rankings have to be interpreted with a big grain of salt for several reasons.  Most important, the rankings are based on all test-statistics that were reported in an article.  Only a few of these statistics test theoretically important hypothesis. Others may be manipulation checks or other incidental analyses.  For the OSC (2015) studies the replicability etimate was 69% when the actual success rate was only 37%.  Moreover, comparisons of the automated extraction method used for the rankings and hand-coding of focal hypothesis in the same article also show a 20% point difference.  Thus, a posted replicability of 70% may imply only 50% replicability for a critical hypothesis test.  Second, the estimates are based on the ideal assumptions underlying statistical test distributions. Violations of these assumptions (outliers) are likely to reduce actual replicability.  Third, actual replication studies are never exact replication studies and minor differences between the studies are also likely to reduce replicability.  There are currently not sufficient actual replication studies to correct for these factors, but the average is likely to be less than 72%. It is also likely to be higher than 37% because this estimate is heavily influenced by social psychology, while cognitive psychology had a success rate of 50%.  Thus, a plausible range of the typical replicability of psychology is somwhere between 40% and 60%.  We might say the glass is half full and have empty, while there is systematic variation around this average across journals.

Conclusion

55 years after Cohen (1962) pointed out that psychologists conduct many studies that produce non-significant results (type-II errors).  For decades there was no sign of improvement.  The preliminary rankings of 2017 provide the first empirical evidence that psychologists are waking up to the replication crisis caused by selective reporting of significant results from underpowered studies.  Right now, social psychologists appear to respond most strongly to concerns about replicability.  However, it is possible that other disciplines will follow in the future as the open science movement is gaining momentum.  Hopefully, replicabilty rankings can provide an incentive to consider replicability as one of several criterion for publication.   A study with z = 2.20 and another study with z = 3.85 are both significant (z > 1.96), but a study with z =3.85 has a higher chance of being replicable. Everything else being equal, editors should favor studies with stronger evidence; that is higher z-scores (a.k.a, lower p-values).  By taking the strength of evidence into account, psychologists can move away from treating all significant results (p < .05) as equal and take type-II errors and power into account.

 

P-REP (2005-2009): Reexamining the experiment to replace p-values with the probability of replicating an effect

In 2005, Psychological Science published an article titled “An Alternative to Null-Hypothesis Significance Tests” by Peter R. Killeen.    The article proposed to replace p-values and significance testing with a new statistic; the probability of replicating an effect (P-rep).  The article generated a lot of excitement and for a period from 2006 to 2009, Psychological Science encouraged reporting p-rep.   After some statistical criticism and after a new editor took over Psychological Science, interest in p-rep declined (see Figure).

It is ironic that only a few years later, psychological science would encounter a replication crisis, where several famous experiments did not replicate.  Despite much discussion about replicability of psychological science in recent years, Killeen’s attempt to predict replication outcome has been hardly mentioned.  This blog post reexamines p-rep in the context of the current replication crisis.

The abstract clearly defines p-rep as an estimate of “the probability of replicating an effect” (p. 345), which is the core meaning of replicability. Factories have high replicability (6 sigma) and produce virtually identical products that work with high probability. However, in empirical research it is not so easy to define what it means to get the same result. So, the first step in estimating replicability is to define the result of a study that a replication study aims to replicate.

“Traditionally, replication has been viewed as a second successful attainment of a significant effect” (Killeen, 2005, p. 349). Viewed from this perspective, p-rep would estimate the probability of obtaining a significant result (p < alpha) after observing a significant result in an original study.

Killeen proposes to change the criterion to the sign of the observed effect size. This implies that p-rep can only be applied to hypothesis with a directional hypothesis (e.g, it does not apply to tests of explained variance).  The criterion for a successful replication then becomes observing an effect size with the same sign as the original study.

Although this may appear like a radical change from null-hypothesis significance testing, this is not the case.  We can translate the sign criterion into an alpha level of 50% in a one-tailed t-test.  For a one-tailed t-test, negative effect sizes have p-values ranging from 1 to .50 and positive effect sizes have p-values ranging from .50 to 0.  So, a successful outcome is associated with a p-value below .50 (p < .50).

If we observe a positive effect size in the original study, we can compute the power of obtaining a positive result in a replicating study with a post-hoc power analysis, where we enter information about the standardized effect size, sample size, and alpha = .50, one-tailed.

Using R syntax this can be achieved with the formula:

Pt(obs.es/se,2-N)

with obs.es being the observed standardized effect size (Cohen’s d), N = total sample size, and se = sampling error = 2/sqrt(N).

The similarity to p-rep is apparent, when we look at the formula for p-rep.

Pnorm(obs.es/se/sqrt(2))

There are two differences. First, p-rep uses the standard normal distribution to estimate power. This is a simplification that ignores the degrees of freedom.  The more accurate formula for power is the non-central t-distribution that takes the degrees of freedom (N-2) into account.  However, even with modest sample sizes of N  =40, this simplification has negligible effects on power estimates.

The second difference is that p-rep reduces the non-centrality parameter (effect size/sampling error) by a factor of square-root 2.  Without going into the complex reasoning behind this adjustment, the end-result of the adjustment is that p-rep will be lower than the standard power estimate.

Using Killeen’s example on page 347 with d = .5 and N = 20, p-rep = .785.  In contrast, the power estimate with alpha = .50 is .861.

The comparison of p-rep with standard power analysis brings up an interesting and unexplored question. “Does p-rep really predict the probability of replication?”  (p. 348).  Killeen (2005) uses meta-analyses to answer this question.  In one example, he found that 70% of studies showed a negative relation between heart rate and aggressive behaviors.  The median value of p-rep over those studies was 71%.  Two other examples are provided.

A better way to evaluate estimates of replicability is to conduct simulation studies where the true answer is known.  For example, a simulation study can simulate 1,000,000 exact replications of Killeen’s example with d = .5 and N = 20 and we can observe how many studies show a positive observed effect size.  In a single run of this simulation, 86,842 studies showed a positive sign. Median P-rep (.788) underestimates this actual success rate, whereas median observed power more closely predicts the observed success rate (.861).

This is not surprising.  Power analysis is designed to predict the long-term success rate given a population effect size, a criterion value, and sampling error.  The adjustment made by Killeen is unnecessary and leads to the wrong prediction.

P-rep applied to Single Studies

It is also peculiar to use meta-analyses to test the performance of p-rep because a meta-analysis implies that many studies have been conducted, whereas the goal of p-rep was to predict the outcome of a single replication study from the outcome of an original study.

This primary aim also explains the adjustment to the non-centrality parameter, which was based on the idea to add the sampling variances of the original and replication study.  Finally, Killeen clearly states that the goal of p-rep is to ignore population effect sizes and to define replicability as “an effect of the same sign as that found in the original experiment” (p. 346).  This is very different from power analysis, which estimates the probability of an effect of the same sign as the population effect size.

We can evaluate p-rep as a predictor of obtaining effect sizes with the same direction in two studies with another simulation study.  Assume that the effect size is d = .20 and the total sample size is also small (N = 20).  The median p-rep estimate is 62%.

The 2 x 2 table shows how often the effect sizes of the original study and the replication study match.

Negative Positive
Negative 11% 22%
Positive 22% 45%

The table shows that the original and replication study match only 45% of the time when the sign also matches the population effect size. Another 11% matches occur when the original and the replication study show the wrong sign and future replication studies are more likely to show the opposite effect size.  Although these cases meet the definition of replicability with the sign of the original study as criterion, it seems questionable to define a pair of studies that both show the wrong result as a successful replication.  Furthermore, the median p-rep estimate of 62% is inconsistent with the correctly matched cases (45%) or the total number of matched cases (45% + 11% = 56%).  In conclusion, it is neither sensible to define replicability as consistency between pairs of exact replication studies, nor does p-rep estimate this probability very well.

Can we fix it?

The previous examination of p-rep showed that it is essentially an observed power estimate with alpha = 50% and an attenuated non-centrality parameter.  Does this mean we can fix p-rep and turn it into a meaningful statistic?  In other words, is it meaningful to compute the probability that future replication studies will reveal the direction of the population effect size by computing power with alpha = 50%?

For example, a research finds an effect size of d = .4 with a total sample size of N = 100.  Using a standard t-test, the research can report the traditional p-value; p = .048.

Negative Positive
Negative 0% 2%
Positive 2% 96%

The simulation results show that the most observations show consistent signs in pairs of studies and are also consistent with the population effect size.  Median observed power, the new p-rep, is 98%. So, is a high p-rep value a good indicator that future studies will also produce a positive sign?

The main problem with observed power analysis is that it relies on the observed effect size as an estimate of the population effect size.  However, in small samples, the difference between observed effect sizes and population effect sizes can be large, which leads to very variable estimates of p-rep. One way to alert readers to the variability in replicability estimates is to provide a confidence interval around the estimate.  As p-rep is a function of the observed effect size, this is easily achieved by converting the lower and upper limit of the confidence interval around the effect size into a confidence interval for p-rep.  With d = .4 and N = 100 (sampling error = 2/sqrt(100) = .20), the confidence interval of effect sizes ranges from d = .008 to d = .792.  The corresponding p-rep values are 52% to 100%.

Importantly, a value of 50% is the lower bound for p-rep and corresponds to determining the direction of the effect by a coin toss.  In other words, the point estimate of replicability can be highly misleading because the observed effect size may be considerably lower than the population effect size.   This means that reporting the point-estimate of p-rep can give false assurance about replicability, while the confidence interval shows that there is tremendous uncertainty around this estimate.

Understanding Replication Failures

Killeen (2005) pointed out that it can be difficult to understand replication failures using the traditional criterion of obtaining a significant result in the replication study.  For example, the original study may have reported a significant result with p = .04 and the replication study produced a non-significant p-value of p = .06.  According to the criterion of obtaining a significant result in the replication study, this outcome is a disappointing failure.  Of course, there is no meaningful difference between p = .04 and p = .06. It just so happens that they are on opposite sides of an arbitrary criterion value.

Killeen suggests that we can avoid this problem by reporting p-rep.  However, p-rep just changes the arbitrary criterion value from p = .05 to d = 0.  It is still possible that a replication study will fail because the effect sizes do not match.  Whereas the effect size in an original study was d = .05, the effect size in the replication study was d = -.05.  In small samples, this is not a meaningful difference in effect sizes, but the outcome constitutes a replication failure.

There is simply no way around making mistakes in inferential statistics.  We can only try to minimize them at the expense of reducing sampling error.  By setting alpha to 50%, we are reducing type-II errors (failing to support a correct hypothesis) at the expense of increasing the risk of a type-I error (failing to accept the wrong hypothesis), but errors will be made.

P-rep and Publication Bias

Killeen (2005) points out another limitation of p-rep.  “One might, of course, be misled by a value of prep that itself cannot be replicated. This can be caused by publication bias against small or negative effects.” (p. 350).  Here we see the real problem of raising alpha to 50%.  If there is no effect (d = 0), one out of two studies will produce a positive result that can be published.  If 100 researchers test an interesting hypothesis in their lab, but only positive results will be published, approximately 50 articles will support a false conclusion, while 50 other articles that showed the opposite result will be hidden in file drawers.  A stricter alpha criterion is needed to minimize the rate of false inferences, especially when publication bias is present.

A counter-argument could be that researchers who find a negative result can also publish their results, because positive and negative results are equally publishable. However, this would imply that journals are filled with inconsistent results and research areas with small effects and small samples will publish nearly equal number of studies with positive and negative results. Each article would draw a conclusion based on the results of a single study and try to explain inconsistent with potential moderator variables.  By imposing a stricter criterion for sufficient evidence, published results are more consistent and more likely to reflect a true finding.  This is especially true, if studies have sufficient power to reduce the risk of type-II errors and if journals do not selectively report studies with positive results.

Does this mean estimating replicability is a bad idea?

Although Killeen’s (2005) main goal was to predict the outcome of a single replication study, he did explore how well median replicability estimates predicted the outcome of meta-analysis.  As aggregation across studies reduces sampling error, replicability estimates based on sets of studies can be useful to predict actual success rates in studies (Sterling et al., 1995).  The comparison of median observed power with actual success rates can be used to reveal publication bias (Schimmack, 2012) and median observed power is a valid predictor of future study outcomes in the absence of publication bias and for homogeneous sets of studies. More advanced methods even make it possible to estimate replicability when publication bias is present and when the set of studies is heterogenous (Brunner & Schimmack, 2016).  So, while p-rep has a number of shortcomings, the idea of estimating replicability deserves further attention.

Conclusion

The rise and fall of p-rep in the first decade of the 2000s tells an interesting story about psychological science.  In hindsight, the popularity of p-rep is consistent with an area that focused more on discoveries than on error control.  Ideally, every study, no matter how small, would be sufficient to support inferences about human behavior.  The criterion to produce a p-value below .05 was deemed an “unfortunate historical commitment to significance testing” (p. 346), when psychologists were only interested in the direction of the observed effect size in their sample.  Apparently, there was no need to examine whether the observed effect size in a small sample was consistent with a population effect size or whether the sign would replicate in a series of studies.

Although p-rep never replaced p-values (most published p-rep values convert into p-values below .05), the general principles of significance testing were ignored. Instead of increasing alpha, researchers found ways to lower p-values to meet the alpha = .05 criterion. A decade later, the consequences of this attitude towards significance testing are apparent.  Many published findings do not hold up when they are subjected to an actual replication attempt by researchers who are willing to report successes and failures.

In this emerging new era, it is important to teach a new generation of psychologists how to navigate the inescapable problem of inferential statistics: you will make errors. Either you falsely claim a discovery of an effect or you fail to provide sufficient evidence for an effect that does exist.  Errors are part of science. How many and what type of errors will be made depends on how scientists conduct their studies.

What would Cohen say? A comment on p < .005

Most psychologists are trained in Fisherian statistics, which has become known as Null-Hypothesis Significance Testing (NHST).  NHST compares an observed effect size against a hypothetical effect size. The hypothetical effect size is typically zero; that is, the hypothesis is that there is no effect.  The deviation of the observed effect size from zero relative to the amount of sampling error provides a test statistic (test statistic = effect size / sampling error).  The test statistic can then be compared to a criterion value. The criterion value is typically chosen so that only 5% of test statistics would exceed the criterion value by chance alone.  If the test statistic exceeds this value, the null-hypothesis is rejected in favor of the inference that an effect greater than zero was present.

One major problem of NHST is that non-significant results are not considered.  To address this limitation, Neyman and Pearson extended Fisherian statistic and introduced the concepts of type-I (alpha) and type-II (beta) errors.  A type-I error occurs when researchers falsely reject a true null-hypothesis; that is, they infer from a significant result that an effect was present, when there is actually no effect.  The type-I error rate is fixed by the criterion for significance, which is typically p < .05.  This means, that a set of studies cannot produce more than 5% false-positive results.  The maximum of 5% false positive results would only be observed if all studies have no effect. In this case, we would expect 5% significant results and 95% non-significant results.

The important contribution by Neyman and Pearson was to consider the complementary type-II error.  A type-II error occurs when an effect is present, but a study produces a non-significant result.  In this case, researchers fail to detect a true effect.  The type-II error rate depends on the size of the effect and the amount of sampling error.  If effect sizes are small and sampling error is large, test statistics will often be too small to exceed the criterion value.

Neyman-Pearson statistics was popularized in psychology by Jacob Cohen.  In 1962, Cohen examined effect sizes and sample sizes (as a proxy for sampling error) in the Journal of Abnormal and Social Psychology and concluded that there is a high risk of type-II errors because sample sizes are too small to detect even moderate effect sizes and inadequate to detect small effect sizes.  Over the next decades, methodologists have repeatedly pointed out that psychologists often conduct studies with a high risk to fail; that is, to provide empirical evidence for real effects (Sedlemeier & Gigerenzer, 1989).

The concern about type-II errors has been largely ignored by empirical psychologists.  One possible reason is that journals had no problem filling volumes with significant results, while rejecting 80% of submissions that also presented significant results.  Apparently, type-II errors were much less common than methodologists feared.

However, in 2011 it became apparent that the high success rate in journals was illusory. Published results were not representative of studies that were conducted. Instead, researchers used questionable research practices or simply did not report studies with non-significant results.  In other words, the type-II error rate was as high as methodologists suspected, but selection of significant results created the impression that nearly all studies were successful in producing significant results.  The influential “False Positive Psychology” article suggested that it is very easy to produce significant results without an actual effect.  This led to the fear that many published results in psychology may be false positive results.

Doubt about the replicability and credibility of published results has led to numerous recommendations for the improvement of psychological science.  One of the most obvious recommendations is to ensure that published results are representative of the studies that are actually being conducted.  Given the high type-II error rates, this would mean that journals would be filled with many non-significant and inconclusive results.  This is not a very attractive solution because it is not clear what the scientific community can learn from an inconclusive result.  A better solution would be to increase the statistical power of studies. Statistical power is simply the inverse of a type-II error (power = 1 – beta).  As power increases, studies with a true effect have a higher chance of producing a true positive result (e.g., a drug is an effective treatment for a disease). Numerous articles have suggested that researchers should increase power to increase replicability and credibility of published results (e.g., Schimmack, 2012).

In a recent article, a team of 72 authors proposed another solution. They recommended that psychologists should reduce the probability of a type-I error from 5% (1 out of 20 studies) to 0.5% (1 out of 200 studies).  This recommendation is based on the belief that the replication crisis in psychology reflects a large number of type-I errors.  By reducing the alpha criterion, the rate of type-I errors will be reduced from a maximum of 10 out of 200 studies to 1 out of 200 studies.

I believe that this recommendation is misguided because it ignores the consequences of a more stringent significance criterion on type-II errors.  Keeping resources and sampling error constant, reducing the type-I error rate increases the type-II error rate. This is undesirable because the actual type-II error is already large.

For example, a between-subject comparison of two means with a standardized effect size of d = .4 and a sample size of N = 100 (n = 50 per cell) has a 50% risk of a type-II error.  The risk of a type-II error raises to 80%, if alpha is reduced to .005.  It makes no sense to conduct a study with an 80% chance of failure (Tversky & Kahneman, 1971).  Thus, the call for a lower alpha implies that researchers will have to invest more resources to discover true positive results.  Many researchers may simply lack the resources to meet this stringent significance criterion.

My suggestion is exactly opposite to the recommendation of a more stringent criterion.  The main problem for selection bias in journals is that even the existing criterion of p < .05 is too stringent and leads to a high percentage of type-II errors that cannot be published.  This has produced the replication crisis with large file-drawers of studies with p-values greater than .05,  the use of questionable research practices, and publications of inflated effect sizes that cannot be replicated.

To avoid this problem, researchers should use a significance criterion that balances the risk of a type-I and type-II error.  For example, with an expected effect size of d = .4 and N = 100, researchers should use p < .20 for significance, which reduces the risk of a type -II error to 20%.  In this case, type-I and type-II error are balanced.  If the study produces a p-value of, say, .15, researchers can publish the result with the conclusion that the study provided evidence for the effect. At the same time, readers are warned that they should not interpret this result as strong evidence for the effect because there is a 20% probability of a type-I error.

Given this positive result, researchers can then follow up their initial study with a larger replication study that allows for a stricter type-I error control, while holding power constant.   With d = 4, they now need N = 200 participants to have 80% power and alpha = .05.  Even if the second study does not produce a significant result (the probability that two studies with 80% power are significant is only 64%, Schimmack, 2012), researchers can combine the results of both studies and with N = 300, the combined studies have 80% power with alpha = .01.

The advantage of starting with smaller studies with a higher alpha criterion is that researchers are able to test risky hypothesis with a smaller amount of resources.  In the example, the first study used “only” 100 participants.  In contrast, the proposal to require p < .005 as evidence for an original, risky study implies that researchers need to invest a lot of resources in a risky study that may provide inconclusive results if it fails to produce a significant result.  A power analysis shows that a sample size of N = 338 participants is needed to have 80% power for an effect size of d = .4 and p < .005 as criterion for significance.

Rather than investing 300 participants into a risky study that may produce a non-significant and uninteresting result (eating green jelly beans does not cure cancer), researchers may be better able and willing to start with 100 participants and to follow up an encouraging result with a larger follow-up study.  The evidential value that arises from one study with 300 participants or two studies with 100 and 200 participants is the same, but requiring p < .005 from the start discourages risky studies and puts even more pressure on researchers to produce significant results if all of their resources are used for a single study.  In contrast, lowering alpha reduces the need for questionable research practices and reduces the risk of type-II errors.

In conclusion, it is time to learn Neyman-Pearson statistic and to remember Cohen’s important contribution that many studies in psychology are underpowered.  Low power produces inconclusive results that are not worthwhile publishing.  A study with low power is like a high-jumper that puts the bar too high and fails every time. We learned nothing about the jumpers’ ability. Scientists may learn from high-jump contests where jumpers start with lower and realistic heights and then raise the bar when they succeeded.  In the same manner, researchers should conduct pilot studies or risky exploratory studies with small samples and a high type-I error probability and lower the alpha criterion gradually if the results are encouraging, while maintaining a reasonably low type-II error.

Evidently, a significant result with alpha = .20 does not provide conclusive evidence for an effect.  However, the arbitrary p < .005 criterion also fails short of demonstrating conclusively that an effect exists.  Journals publish thousands of results a year and some of these results may be false positives, even if the error rate is set at 1 out of 200. Thus, p < .005 is neither defensible as a criterion for a first exploratory study, nor conclusive evidence for an effect.  A better criterion for conclusive evidence is that an effect can be replicated across different laboratories and a type-I error probability of less than 1 out of a billion (6 sigma).  This is by no means an unrealistic target.  To achieve this criterion with an effect size of d = .4, a sample size of N = 1,000 is needed.  The combined evidence of 5 labs with N = 200 per lab would be sufficient to produce conclusive evidence for an effect, but only if there is no selection bias.  Thus, the best way to increase the credibility of psychological science is to conduct studies with high power and to minimize selection bias.

This is what I believe Cohen would have said, but even if I am wrong about this, I think it follows from his futile efforts to teach psychologists about type-II errors and statistical power.

How Replicable are Focal Hypothesis Tests in the Journal Psychological Science?

Over the past five years, psychological science has been in a crisis of confidence.  For decades, psychologists have assumed that published significant results provide strong evidence for theoretically derived predictions, especially when authors presented multiple studies with internal replications within a single article (Schimmack, 2012). However, even multiple significant results provide little empirical evidence, when journals only publish significant results (Sterling, 1959; Sterling et al., 1995).  When published results are selected for significance, statistical significance loses its ability to distinguish replicable effects from results that are difficult to replicate or results that are type-I errors (i.e., the theoretical prediction was false).

The crisis of confidence led to several initiatives to conduct independent replications. The most informative replication initiative was conducted by the Open Science Collaborative (Science, 2015).  It replicated close to 100 significant results published in three high-ranked psychology journals.  Only 36% of the replication studies replicated a statistically significant result.  The replication success rate varied by journal.  The journal “Psychological Science” achieved a success rate of 42%.

The low success rate raises concerns about the empirical foundations of psychology as a science.  Without further information, a success rate of 42% implies that it is unclear which published results provide credible evidence for a theory and which findings may not replicate.  It is impossible to conduct actual replication studies for all published studies.  Thus, it is highly desirable to identify replicable findings in the existing literature.

One solution is to estimate replicability for sets of studies based on the published test statistics (e.g., F-statistic, t-values, etc.).  Schimmack and Brunner (2016) developed a statistical method, Powergraphs, that estimates the average replicability of a set of significant results.  This method has been used to estimate replicability of psychology journals using automatic extraction of test statistics (2016 Replicability Rankings, Schimmack, 2017).  The results for Psychological Science produced estimates in the range from 55% to 63% for the years 2010-2016 with an average of 59%.   This is notably higher than the success rate for the actual replication studies, which only produced 42% successful replications.

There are two explanations for this discrepancy.  First, actual replication studies are not exact replication studies and differences between the original and the replication studies may explain some replication failures.  Second, the automatic extraction method may overestimate replicability because it may include non-focal statistical tests. For example, significance tests of manipulation checks can be highly replicable, but do not speak to the replicability of theoretically important predictions.

To address the concern about automatic extraction of test statistics, I estimated replicability of focal hypothesis tests in Psychological Science with hand-coded, focal hypothesis tests.  I used three independent data sets.

Study 1

For Study 1, I hand-coded focal hypothesis tests of all studies in the 2008 Psychological Science articles that were used for the OSC reproducibility project (Science, 2015).

OSC.PS

The powergraphs show the well-known effect of publication bias in that most published focal hypothesis tests report a significant result (p < .05, two-tailed, z > 1.96) or at least a marginally significant result (p < .10, two-tailed or p < .05, one-tailed, z > 1.65). Powergraphs estimate the average power of studies with significant results on the basis of the density distribution of significant z-scores.  Average power is an estimate of replicabilty for a set of exact replication studies.  The left graph uses all significant results. The right graph uses only z-scores greater than 2.4 because questionable research practices may produce many just-significant results and lead to biased estimates of replicability. However, both estimation methods produce similar estimates of replicability (57% & 61%).  Given the small number of statistics the 95%CI is relatively wide (left graph: 44% to 73%).  These results are compatible with the low actual success rate for actual replication studies (42%) and the estimate based on automated extraction (59%).

Study 2

The second dataset was provided by Motyl et al. (JPSP, in press), who coded a large number of articles from social psychology journals and psychological science. Importantly, they coded a representative sample of Psychological Science studies from the years 2003, 2004, 2013, and 2014. That is, they did not only code social psychology articles published in Psychological Science.  The dataset included 281 test statistics from Psychological Science.

PS.Motyl

The powergraph looks similar to the powergraph in Study 1.  More important, the replicability estimates are also similar (57% & 52%).  The 95%CI for Study 1 (44% to 73%) and Study 2 (left graph: 49% to 65%) overlap considerably.  Thus, two independent coding schemes and different sets of studies (2008 vs. 2003-2004/2013/2014) produce very similar results.

Study 3

Study 3 was carried out in collaboration with Sivaani Sivaselvachandran, who hand-coded articles from Psychological Science published in 2016.  The replicability rankings showed a slight positive trend based on automatically extracted test statistics.  The goal of this study was to examine whether hand-coding would also show an increase in replicability.  An increase was expected based on an editorial by D. Stephen Linday, incoming editor in 2015, who aimed to increase replicability of results published in Psychological Science by introducing badges for open data and preregistered hypotheses. However, the results failed to show a notable increase in average replicability.

PS.2016

The replicability estimate was similar to those in the first two studies (59% & 59%).  The 95%CI ranged from 49% to 70%. These wide confidence intervals make it difficult to notice small improvements, but the histogram shows that just significant results (z = 2 to 2.2) are still the most prevalent results reported in Psychological Science and that non-significant results that are to be expected are not reported.

Combined Analysis 

Given the similar results in all three studies, it made sense to pool the data to obtain the most precise estimate of replicability of results published in Psychological Science. With 479 significant test statistics, replicability was estimated at 58% with a 95%CI ranging from 51% to 64%.  This result is in line with the estimated based on automated extraction of test statistics (59%).  The reason for the close match between hand-coded and automated results could be that Psych Science publishes short articles and authors may report mostly focal results because space does not allow for extensive reporting of other statistics.  The hand-coded data confirm that replicabilty in Psychological Science is likely to be above 50%.

PS.combined

It is important to realize that the 58% estimate is an average.  Powergraphs also show average replicability for segments of z-scores. Here we see that replicabilty for just-significant results (z < 2.5 ~ p > .01) is only 35%. Even for z-score between 2.5 and 3.0 (~ p > .001) is only 47%.  Once z-scores are greater than 3, average replicabilty is above 50% and with z-scores greater than 4, replicability is greater than 80%.  For any single study, p-values can vary greatly due to sampling error, but in general a published result with a p-value < .001 is much more likely to replicate than a p-value > .01 (see also OSC, Science, 2015).

Conclusion

This blog-post used hand-coding of test-statistics published in Psychological Science, the flagship journal of the Association for Psychological Science, to estimate replicabilty of published results.  Three dataset produced convergent evidence that the average replicabilty of exact replication studies is 58% +/- 7%.  This result is consistent with estimates based on automatic extraction of test statistics.  It is considerably higher than the success rate of actual replication studies in the OSC reproducibility project (42%). One possible reason for this discrepancy is that actual replication studies are never exact replication studies, which makes it more difficult to obtain statistical significance if the original studies are selected for significance. For example, the original study may have had an outlier in the experimental group that helped to produce a significant result. Not removing this outlier is not considered a questionable research practice, but an exact replication study will not reproduce the same outlier and may fail to reproduce a just-significant result.  More broadly, any deviation from the assumptions underlying the computation of test statistics will increase the bias that is introduced by selecting significant results.  Thus, the 58% estimate is an optimistic estimate of the maximum replicability under ideal conditions.

At the same time, it is important to point out that 58% replicability for Psychological Science does not mean psychological science is rotten to the core (Motyl et al., in press) or that most reported results are false (Ioannidis, 2005).  Even results that did not replicate in actual replication studies are not necessarily false positive results.  It is possible that more powerful studies would produce a significant result, but with a smaller effect size estimate.

Hopefully, these analyses will spur further efforts to increase replicability of published results in Psychological Science and in other journals.  We are already near the middle of 2017 and can look forward to the 2017 results.