Rindex Logo

Dr. R’s Blog about Replicability

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DEFINITION OF REPLICABILITY:  In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

New (May 18, 2016):  Subjective Priors: Putting Bayes Into Bayes Factors

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
REPLICABILITY REPORTS:  Examining the replicability of research topics

RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

TOP TEN LIST

pecking.order

1. 2015 Replicability Rankings of over 100 Psychology Journals
Based on reported test statistics in all articles from 2015, the rankings show the typical strength of evidence for a statistically significant result in particular journals.  The method also estimates the file-drawer of unpublished non-significant results. Links to powergraphs provide further information (e.g., whether a journal has too many just significant results (p < .05 & p > .025).

weak

2. A (preliminary) Introduction to the Estimation of Replicability for Sets of Studies with Heterogeneity in Power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

trust

3.  Replicability-Rankings of Psychology Departments
This blog presents rankings of psychology departments on the basis of the replicability of significant results published in 105 psychology journals (see the journal rankings for a list of journals).   Reported success rates in psychology journals are over 90%, but this percentage is inflated by selective reporting of significant results.  After correcting for selection bias, replicability is 60%, but there is reliable variation across departments.

Say-No-to-Doping-Test-Image

4. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

5.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z < 1.96, p > .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

YM figure9

6.  Validation of Meta-Analysis of Observed (post-hoc) Power
This post examines the ability of various estimation methods to estimate power of a set of studies based on the reported test statistics in these studies.  The results show that most estimation methods work well when all studies have the same effect size (homogeneous) or if effect sizes are heterogeneous and symmetrically distributed (heterogeneous). However, most methods fail when effect sizes are heterogeneous and have a skewed distribution.  The post does not yet include the more recent method that uses the distribution of z-scores (powergraphs) to estimate observe power because this method was developed after this blog was posted.

004_4

7. Roy Baumeister’s R-Index
Roy Baumeister was a reviewer of my 2012 article that introduced the Incredibiliy Index to detect publication bias and dishonest reporting practices.  In his review and in a subsequent email exchange, Roy Baumeister admitted that his published article excluded studies that failed to produce results in support of his theory that blood-glucose is important for self-regulation (a theory that is now generally considered to be false), although he disagrees that excluding these studies was dishonest.  The R-Index builds on the incredibility index and provides an index of the strength of evidence that corrects for the influence of dishonest reporting practices.  This post reports the R-Index for Roy Baumeister’s most cited articles. The R-Index is low and does not justify the nearly perfect support for empirical predictions in these articles. At the same time, the R-Index is similar to R-Indices for other sets of studies in social psychology.  This suggests that dishonest reporting practices are the norm in social psychology and that published articles exaggerate the strength of evidence in support of social psychological theories.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/8. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

tide_naked

9.  The R-Index for 18 Multiple-Study Psychology Articles in the Journal SCIENCE.
Francis (2014) demonstrated that nearly all multiple-study articles by psychology researchers that were published in the prestigious journal SCIENCE showed evidence of dishonest reporting practices (disconfirmatory evidence was missing).  Francis (2014) used a method similar to the incredibility index.  One problem of this method is that the result is a probability that is influenced by the amount of bias and the number of results that were available for analysis. As a result, an article with 9 studies and moderate bias is treated the same as an article with 4 studies and a lot of bias.  The R-Index avoids this problem by focusing on the amount of bias (inflation) and the strength of evidence.  This blog post shows the R-Index of the 18 studies and reveals that many articles have a low R-Index.

snake-oil

10.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

 

 

 

 

 

 

 

 

 

 

 

 

 

UliPen

Fritz Strack’s self-serving biases in his personal account of the failure to replicate his most famous study.

[please hold pencil (pen does not work) like this while reading this blog post]

In “Sad Face: Another classic finding in psychology—that you can smile your way to happiness—just blew up. Is it time to panic yet?”  b Daniel Engber, Fritz Strack gets to tell his version of the importance of his original study and what it means that it failed to replicate in a recent attempt to replicate his original results in 17 independent replication studies.   In this blog post, I provide my commentary on Fritz Strack’s story to reveal inconsistencies, omissions of important fact, and false arguments to discount the results of the replication studies.

PART I:  Prior to the Replication of Strack et al. (1988)

In 2011, many psychologists lost confidence in social psychology as a science.  One social psychologists had fabricated data at midnight in his kitchen.  Another presented incredible results that people can foresee random events in the future.  And finally, a researcher failed to replicate a famous study where subtle reminders of elderly people made students walk more slowly.  A New Yorker article captured the mood of the time.  It wasn’t clear which findings one should believe and would replicate under close scrutiny?  In response, psychologists created a new initiative to replicate original findings across many independent labs.  A first study produced encouraging results.  Many classic findings in psychology (like the anchoring effect) replicated sometimes even with stronger effect sizes than in the original study.  However, some studies didn’t replicate.  Especially, results from a small group of social psychologists who had built their career around the idea that small manipulations can have strong effects on participants’ behavior without participants’ awareness (such as the elderly priming study) did not replicate well.   The question was which results from this group of social psychologists who study unconscious or implicit processes would replicate?

Quote “The experts were reluctant to step forward. In recent months their field had fallen into scandal and uncertainty: An influential scholar had been outed as a fraud; certain bedrock studies—even so-called “instant classics”—had seemed to shrivel under scrutiny. But the rigidity of the replication process felt a bit like bullying. After all, their work on social priming was delicate by definition: It relied on lab manipulations that had been precisely calibrated to elicit tiny changes in behavior. Even slight adjustments to their setups, or small mistakes made by those with less experience, could set the data all askew. So let’s say another lab—or several other labs—tried and failed to copy their experiments. What would that really prove? Would it lead anyone to change their minds about the science?”

The small group of social psychologist felt under attack.  They had published hundreds of articles and become famous for demonstrating the influence of unconscious processes that by definition were ignored by people when they tried to understand their own behaviors because they operated in secrecy, undetected by conscious introspection.  What if all of their amazing discoveries were not real?  Of course, the researchers were aware that not all studies worked. After all, they often encountered failures to find these effects in their own lab.  It often required several attempts to get the right conditions to produce results that could be published.  If a group of researchers would just go into the lab and do the study once, how would we know that they did everything right. Given ample evidence of failure in their own labs, nobody from this group wanted to step forward and replicate their own study or subject their study to a one-shot test. 

Quote “Then on March 21, Fritz Strack, the psychologist in Wurzburg, sent a message to the guys. “Don’t get me wrong,” he wrote, “but I am not a particularly religious person and I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.” So if the skeptics wanted something to examine—a test case to stand in for all of social-psych research—then let them try his work.”

Fritz Strack was not afraid of failure.  He volunteered his most famous study for a replication project.

Quote “ In 1988, Strack had shown that movements of the face lead to movements of the mind. He’d proved that emotion doesn’t only go from the inside out, as Malcolm Gladwell once described it, but from the outside in.”

It is not exactly clear why Strack picked his 1988 for replication.  The article included two studies. The first study produced a result that is called marginally significant.  That is, it did not meet the standard criterion of evidence, a p-value less than .05 (two-tailed).  But the p-value was very close to .05 and less than .10 (or .05 one-tailed).   This finding alone would not justify great confidence in the replicability of the original finding.  Moreover, a small study with so much noise makes it impossible to estimate the true effect size. The observed effect size in the study was large, but this could have been due to luck (sampling error).  In a replication study, the effect size could be a lot smaller, which would make it difficult to get a significant result in a replication study.

The key finding of this study was that manipulating participants’ facial muscles appeared to influence their feelings of amusement in response to funny cartoons without participants’ awareness that their facial muscles contributed to the intensity of the experience.  This finding made sense in the context of a long tradition of theories that assumed feedback from facial muscles plays an important role in the experience of emotions. 

Strack seemed to be confident that his results would replicate because many other articles also reported results that seemed to support the facial feedback hypothesis.  His study became famous because it used an elaborate cover story to ensure that the effect occurred without participants’ awareness.

Quote: “In lab experiments, facial feedback seemed to have a real effect…But Strack realized that all this prior research shared a fundamental problem: The subjects either knew or could have guessed the point of the experiments. When a psychologist tells you to smile, you sort of know how you’re expected to feel.”

Strack was not the first to do so. 

Quote: “In the 1960s, James Laird, then a graduate student at the University of Rochester, had concocted an elaborate ruse: He told a group of students that he wanted to record the activity of their facial muscles under various conditions, and then he hooked silver cup electrodes to the corners of their mouths, the edges of their jaws, and the space between their eyebrows. The wires from the electrodes plugged into a set of fancy but nonfunctional gizmos… Subjects who had put their faces in frowns gave the cartoons an average rating of 4.4; those who put their faces in smiles judged the same set of cartoons as being funnier—the average jumped to 5.5.”

 

A change by 1.1 points on a rating scale is a huge effect and consistent results across different studies would suggest that the effect can be easily replicated.   The point of Strack’s study was not to demonstrate the effect, but to improve the cover story that made it difficult for participants to guess the real purpose of the study.

“Laird’s subterfuge wasn’t perfect, though. For all his careful posturing, it wasn’t hard for the students to figure out what he was up to. Almost one-fifth of them said they’d figured out that the movements of their facial muscles were related to their emotions. Strack and Martin knew they’d have to be more crafty. At one point on the drive to Mardi Gras, Strack mused that maybe they could use thermometers. He stuck his finger in his mouth to demonstrate.  Martin, who was driving, saw Strack’s lips form into a frown in the rearview mirror. That would be the first condition. Martin had an idea for the second one: They could ask the subjects to hold thermometers—or better, pens—between their teeth. This would be the stroke of genius that produced a classic finding in psychology.”

So in a way, Strack et al.’s study was a conceptual replication study of Laird’s study that used a different manipulation of facial muscles. And the replication study was successful.

“The results matched up with those from Laird’s experiment. The students who were frowning, with their pens balanced on their lips, rated the cartoons at 4.3 on average. The ones who were smiling, with their pens between their teeth, rated them at 5.1. What’s more, not a single subject in the study noticed that her face had been manipulated. If her frown or smile changed her judgment of the cartoons, she’d been totally unaware.”

However, even though the effect size is still large, an .8 difference in ratings, the effect was only marginally significant.  A second study by Strack et al. also produced only a marginally significant results. Thus, we may start to wonder why the researchers were not able to produce stronger evidence for the effect that would produce a significant result at the conventional criterion that is required for claiming a discovery, p < .05 (two-tailed)?   And why did this study become a classic without stronger evidence that the effect is real and that the effect is really as large as the reported effect sizes in these studies.  The effect size may not matter for basic research studies that merely want to demonstrate that the effect exists, but it is important for applications to the real word. If an effect is large under strictly controlled laboratory conditions, the effect is going to be much smaller in real world situations where many of the factors that are controlled in the laboratory also influence emotional experiences.  This might also explain why people normally do not notice the contribution of their facial expressions to their experiences.  Relative to their mood, the funniness of a joke, the presence of others, and a dozen more contextual factors that influence our emotional experiences, feedback from facial muscles may make a very small contribution to emotional experiences.  Strack seems to agree.

Quote “It was theoretically trivial,” says Strack, but his procedure was both clever and revealing, and it seemed to show, once and for all, that facial feedback worked directly on the brain, without the intervention of the conscious mind. Soon he was fielding calls from journalists asking if the pen-in-mouth routine might be used to cure depression. He laughed them off. There are better, stronger interventions, he told them, if you want to make a person happy.”

Strack may have been confident that his study would replicate because other publications used his manipulation and also reported significant results.  And researchers even proposed that the effect is strong enough to have practical implications in the real world.  One study even suggested that controlling facial expressions can reduce prejudice.

Quote: “Strack and Martin’s method would eventually appear in a bewildering array of contexts—and be pushed into the realm of the practical. If facial expressions could influence a person’s mental state, could smiling make them better off, or even cure society’s ills? It seemed so. In 2006, researchers at the University of Chicago showed that you could make people less racist by inducing them to smile—with a pen between their teeth—while they looked at pictures of black faces.”

The result is so robust that replicating it is a piece of cake, a walk in the park, and works even in classroom demonstrations.

“Indeed, the basic finding of Strack’s research—that a facial expression can change your feelings even if you don’t know that you’re making it—has now been reproduced, at least conceptually, many, many times. (Martin likes to replicate it with the students in his intro to psychology class.)”

Finally, Strack may have been wrong when he laughed off questions about curing depression with controlling facial muscles.  Apparently, it is much harder to commit suicide if you put a pen in your mouth to make yourself smile.

Quote: “In recent years, it has even formed the basis for the treatment of mental illness. An idea that Strack himself had scoffed at in the 1980s now is taken very seriously: Several recent, randomized clinical trials found that injecting patients’ faces with Botox to make their “frown lines” go away also helped them to recover from depression.”

So, here you have it. If you ignore publication bias and treat the mountain of confirmatory evidence with a 100% success rate in journals as credible evidence, there is little doubt that the results would replicate. Of course, by the same standard of evidence there is no reason to doubt that other priming studies would replicate, which they did until a group of skeptical researchers tried to replicate the results and failed to do so. 

Quote: “Strack found himself with little doubt about the field. “The direct influence of facial expression on judgment has been demonstrated many, many times,” he told me. “I’m completely convinced.” That’s why he volunteered to help the skeptics in that email chain three years ago. “They wanted to replicate something, so I suggested my facial-feedback study,” he said. “I was confident that they would get results, so I didn’t know how interesting it would be, but OK, if they wanted to do that? It would be fine with me.”

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PART II:  THE REPLICATION STUDY

The replication project was planned by EJ Wagenmakers, who made his name as a critic of research practices in social psychology in response to Bem’s (2011) incredible demonstration of feelings that predict random future events.  Wagenmakers believes that many published results are not credible because the studies failed to test theoretical predictions. Social psychologists would run many studies and publish results when they discovered a significant result with p < .05 (at least one-tailed).  When the results did not become significant the study was considered a failure and not reported.  This practice makes it difficult to predict which results are real and replicate and which results are not real and do not replicate.  Wagenmakers estimated that the facial feedback study had a 30% chance to replicate.

Quote “Personally, I felt that this one actually had a good chance to work,” he said. How good a chance? I gave it a 30-percent shot.” [Come again.  A good chance is 30%?]

A 30% probability may be justified because a replication project by the Open Science Collaborative found that only 25% of social psychological results were successfully replicated.  However, this project used only slightly larger samples than the original studies.  In the replication of the facial feedback hypothesis, 17 labs with larger samples than the original studies and nearly 2000 participants were going to replicate the original study.  The increase in sample size increases the chances of producing a significant result even if the effect size of the original study was vastly inflated.  If a result is not significant with 2,000 participants, it becomes possible to say that the effect may actually not exist or that the effect size is so small to be practically meaningless and definitely have no relevance for the treatment of depression.  Thus, the prediction that there is only a 30% chance of success implies that Wagenmakers was very skeptical about the original results and expected a drastic reduction in the effect size.

Quote “In a sense, he was being optimistic. Replication projects have had a way of turning into train wrecks. When researchers tried to replicate 100 psychology experiments from 2008, they interpreted just 39 of the attempts as successful. In the last few years, Perspectives on Psychological Science has been publishing “Registered Replication Reports,” the gold standard for this type of work, in which lots of different researchers try to re-create a single study so the data from their labs can be combined and analyzed in aggregate. Of the first four of these to be completed, three ended up in failure.”

There were good reasons to be skeptical.  First, the facial feedback theory is controversial. There are two camps in psychology .One camp assumes that emotions are generated in the brain in direct response to cognitive appraisals of the environment. Others have argued that emotional experiences are based on bodily feedback.  The controversy goes back to James versus Cannon and lead to the famous Lazarus-Zajonc debate in the 1980s at the beginning of modern emotion research.  There is also the problem that it is statistically improbable that Strack et al. (1988) would get marginally significant results twice in a row in two independent replications of their study.  Sampling error makes p-values move around and the chance of getting p < .10 and p > .05 twice in a row is slim. This suggests that the evidence was partially obtained with a healthy dose of sampling error and that a replication study would produce weaker effect sizes.

Quote: The work on facial feedback, though, had never been a target for the doubters; no one ever tried to take it down. Remember, Strack’s original study had confirmed (and then extended) a very old idea. His pen-in-mouth procedure worked in other labs.

Strack also had some reasons why the replication project would not produce straight replications of his findings, because he claims that the original study did not produce a huge effect.

Quote “He acknowledged that the evidence from the paper wasn’t overwhelming—the effect he’d gotten wasn’t huge. Still, the main idea had withstood a quarter-century of research, and it hadn’t been disputed in a major, public way. “I am sure some colleagues from the cognitive sciences will manage to come up with a few nonreplications,” he predicted. But he thought the main result would hold.”

But that is wrong.  The study did produce a surprisingly huge effect.  It just didn’t produce strong evidence that this effect was caused by facial feedback rather than problems with the randomized assignment of participants to conditions.  His sample sizes were so small that the large effect was only a bit more than 1.5 times of the standard deviation, which is just enough to claim a discovery with p < .05 one-tailed, but not 2 times of the standard deviation, which is needed to claim a discovery with p < .05 two-tailed.   So, the reported effect size was huge, but the strength of evidence was not.  Taking the reported effect size at face value, one would predict that only every other study would produce a significant result and the other studies would fail to replicate his results.  So even if 17 laboratories would successfully replicate his study and the true effect size was as large as the effect size reported by Strack et al., only half of the labs would be able to claim a successful replication.  As sample sizes were a bit larger in the replication studies, the percentage would be a bit higher, but clearly nobody should expect that all labs individually produce at least marginally significant results.  In fact, it is unlikely that Strack was able to get two significant results in his two reported studies.

After several years of planning, collecting data, and analyzing the data the results were reported.  Not a single lab had produced a significant result. More important, even a combined analysis of data from close to 2,000 participants showed no effect.  The effect size was close to zero.   In other words, there was no evidence that facial feedback had any influence on ratings of amusement in response to cartoons.  This is what researchers call an epic fail.  The study did not just fail in a replication with a smaller sample. It didn’t produce a significant result with a smaller effect size estimate.  The effect just doesn’t appear to be there, although even with 2,000 participants it is not possible to say that the effect is zero.  The results leave a possibility that a very small effect may exist, but an even larger sample would be needed to test this hypothesis. At the same time, the results are not inconsistent with the original results because the original study had so much noise that the population effect size could have been close to zero.   

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

PART III: Response to the Replication Failure

We might think that Strack was devastated by the failure to replicate his most famous result that he has produced in his research career.  However, he is rather unmoved by these results.

Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.

This is a bit odd because earlier Strack assured us that he is not religious and trusts the scientific method. “I am always disturbed if people are divided into ‘believers’ and ‘nonbelievers.’ ” In science, he added, “the quality of arguments and their empirical examination should be the basis of discourse.”   So here we have two original studies with weak evidence for an effect and 17 studies with no evidence for the effect and if we combine the information of all 19 studies, we have no evidence for an effect, and to believe in an effect even though 19 studies fail to provide scientific evidence for it seems a bit religious although I would make a distinction between really religious individuals who know that they believe in something and wanna-be-scientists who believe that they know something.  How does Strack justify his belief in an effect that just failed to replicate?  He refers to an article (take-down) by himself that according to his own account shows fundamental problems with the idea that failed replication studies provide meaningful information.  Apparently, only original studies provide meaningful information and when replication studies fail to replicate the results of original studies there must be a problem with the replication studies.

Quote: “Two years ago, while the replication of his work was underway, Strack wrote a takedown of the skeptics’ project with the social psychologist Wolfgang Stroebe. Their piece, called “The Alleged Crisis and the Illusion of Exact Replication,” argued that efforts like the RRR reflect an “epistemological misunderstanding,”

Accodingly, Bem(2011) did successfully demonstrate that humans (at least extraverted humans) can successfully predict random events in the future and learning after an exam can retroactively improve performance on the completed exam.  The fact that replication studies failed to replicate these results only shows an epistemic misunderstanding that we can learn anything from replication studies by skeptics.  So what is the problem with replication studies?

Quote: “Since it’s impossible to make a perfect copy of an old experiment. People change, times change, and cultures change, they said. No social psychologist ever steps in the same river twice. Even if a study could be reproduced, they added, a negative result wouldn’t be that interesting, because it wouldn’t explain why the replication didn’t work.”

We cannot reproduce exactly the same conditions of the original experiment.  But, why is that important.  The same paradigm was allegedly used to reduce prejudice and cure depression, in studies that are wildly different from the original studies.  It worked even then. So, why did it not work when the original study was replicated as closely as possible. And why would we care about a study that worked (marginally) in 92 undergraduate students at the University of Illinois in the 1980s in 2016?  We don’t.  For humans in 2016, the results of a study in 2015 are more relevant. Maybe it worked, may be it didn’t. We will never know, but now we do now that it typically doesn’t work in 2015.  Maybe it will work again in 2017. Who knows. But we cannot claim that there is good support for the facial feedback theory since Darwin came up with it.

But Strack goes further.  When he looks at the results of the replication studies, he does not see what the authors of the replication studies see. 

Quote: “So when Strack looks at the recent data he sees not a total failure but a set of mixed results.”

17 studies all find no effect and all studies are consistent with the hypothesis that there is no effect; the 95% confidence interval includes 0, which is also true for Stracks’ original two studies.”  How can somebody see mixed results in this consistent pattern of results?

Quote:  Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed?

He simply post-hoc divides studies into studies that produced a positive result and studies that produced a negative result. There is no justification for this because none of these studies are individually significantly different from each other and the overall test shows that there is no heterogeneity; that is the results are consistent with the hypothesis that the true population effect size is 0 and that all of the variability in effects across studies is just random noise that is expected from studies with modest sample sizes.

Quote: “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. How could he turn his back on all that evidence?”

And with this final quote, Strack is leaving the realm of scientific discourse and proper interpretation of empirical facts.  He is willing to disregard the results of a scientific test of the facial feedback hypothesis that he initially agreed to.  It is now clear why he agreed to it.  He never considered it a real test of his theory. No matter what the results would be he would maintain his believe in his couple of marginally significant results that are statistically improbable.  Social psychologists have of course studied how humans respond to negative information that challenges their self-esteem and world views.  Unlike facial feedback, the results are robust and not surprising.  Humans are prone to dismiss inconvenient evidence and to construe sometimes ridiculous arguments in order to prop up cherished false beliefs.   As such, Strack’s response to the failure of his most famous article is a successful demonstration that some findings in social psychology are replicable;  it just so happens that Strack’s study is not one of these findings.

Strack comes up with several objections to the replication studies that show his ignorance about the whole project.  For example, he claims that many participants may have guessed the purpose of the study because the study is now a textbook finding.  However, the researchers who conducted the replication studies made sure that the study was conducted before the study was covered in class and some universities do not cover it at all. Moreover, just like Laird, participants who guessed the purpose were excluded.  A lot more participants were excluded because they didn’t hold the pen properly. Of course, this should strengthen the effect because the manipulation should not work when the wrong facial muscles are activated.

Strack even claims that the whole project lacked a research question.

Quote: “Strack had one more concern: “What I really find very deplorable is that this entire replication thing doesn’t have a research question.” It does “not have a specific hypothesis, so it’s very difficult to draw any conclusions,” he told me.”

This makes no sense. Participants were randomly allocated to two conditions and a dependent variable was measured.  The hypothesis was that holding the pen in a way that elicits a smile leads to higher ratings of amusement than holding the pen in a way that leads to a frown.  The empirical question was whether this manipulation would have an effect and this was assessed with a standard test of statistical significance.  The answer was that there was no evidence for the effect.   The research question was the same as in the original study. If this is not a research question than the original study also had no research question. 

And finally, Strack makes the unscientific claim that it simply cannot be true that the reported studies all got it wrong.

Quote: The RRR provides no coherent argument, he said, against the vast array of research, conducted over several decades, that supports his original conclusion. “You cannot say these [earlier] studies are all p-hacked,” Strack continued, referring to the battery of ways in which scientists can nudge statistics so they work out in their favor. “You have to look at them and argue why they did not get it right.”

Scientific journals select studies that produced significant results. As a result, all prior studies were published because they produced a significant (or at least marginally significant) result.  Given the selectin for significance, there is no error control.  The number of successful replications in the published literature tells us nothing about the truth of a finding.  We do not have to claim that all studies were p-hacked. We can just say all studies were selected to be significant and that is true and well known.  As a result, we do not know which results will replicate until we have conducted replication studies and do not select for significance. This is what the RRR did. As a result, it provides the first unbiased and real empirical test of the facial feedback hypothesis and it failed. That is science. Ignoring it is not.

Closer inspection of the original article by Daniel Engber shows further problems.  

Quote: For the second version, Strack added a new twist. Now the students would have to answer two questions instead of one: First, how funny was the cartoon, and second, how amused did it make them feel? This was meant to help them separate their objective judgments of the cartoons’ humor from their emotional reactions. When the students answered the first question—“how funny is it?,” the same one that was used for Study 1—it looked as though the effect had disappeared. Now the frowners gave the higher ratings, by 0.17 points. If the facial feedback worked, it was only on the second question, “how amused do you feel?” There, the smilers scored a full point higher. (For the RRR, Wagenmakers and the others paired this latter question with the setup from the first experiment.) In effect, Strack had turned up evidence that directly contradicted the earlier result: Using the same pen-in-mouth routine, and asking the same question of the students, he’d arrived at the opposite answer. Wasn’t that a failed replication, or something like it?”

Strack dismisses this concern as well, but Daniel Engber is not convinced.

Quote:  “Strack didn’t think so. The paper that he wrote with Martin called it a success: “Study 1’s findings … were replicated in Study 2.”… That made sense, sort of. But with the benefit of hindsight—or one could say, its bias—Study 2 looks like a warning sign. This foundational study in psychology contained at least some hairline cracks. It hinted at its own instability. Why didn’t someone notice?

And nobody else should be convinced.  Fritz Strack is a prototypical example of a small group of social psychologists that has ruined social psychology by engaging in a game of publishing results that were consistent with theories of strong and powerful effects of stimuli on people’s behavior outside their awareness.  These results were attention-grabbing just like annual returns of 20% would be eye-catching returns.  Many people invested in these claims on the basis of flimsy evidence that doesn’t even withstand scrutiny by a science journalist.  And to be clear, only a few of them did go as far to fabricate data. But many others fabricated facts by publishing only studies that supported their claims while hiding evidence from studies that failed to show the effect.  Now we see what happens when these claims are subjected to real empirical tests that can succeed or fail. Many of the fail.  For future generations it is not important why they did what they did and how they feel about it now. What is important is that we realize that many results in textbooks are not based on solid evidence and social psychology needs to change the way they conduct research if it wants to become a real science that builds on empirically verifiable facts.  Strack’s response to the RRR is what it is a defensive reaction to evidence that his famous article was based on a false positive result. 

How Can We Interpret Inferences with Bayesian Hypothesis Tests?

SUMMARY

In this blog post I show how it is possible to translate the results of a Bayesian Hypothesis Test into an equivalent frequentist statistical test that follows Neyman Pearsons approach of hypthesis testing where hypotheses are specified as ranges of effect sizes (critical regions) and observed effect sizes are used to make inferences about population effect sizes with long-run error rates.

INTRODUCTION

The blog post also explains why it is misleading to consider Bayes Factors that favor the null-hypothesis (d = 0) over an alternative hypothesis (e.g., Jeffrey’s prior) as evidence for the absence of an effect.  This conclusion is only warranted with infinite sample sizes, but with finite sample sizes, especially small sample sizes that are typical in psychology,  Bayes Factors in favor of H0 can only be interpreted as evidence that the population effect size is close to zero, but not as evidence that the population effect size is exactly zero.  How close the effect sizes that are consistent with H0 are depends on sample size and the criterion value that is used to interpret the results of a study as sufficient evidence for H0.

One problem with Bayes Factors is that like p-values, they are a continuous measure of likelihoods, just like p-values are a continuous measure of probabilities, and the observed value is not sufficient to justify an inference or interpretation of the data. This is why psychologists moved from Fisher’s approach to Neyman Pearson’s approach that compared an observed p-value to a specified (by convention or pre-registertation) criterion value. For p-values this is alpha. If p < alpha, we reject H0:d = 0 in favor of H1, there was a (positive or negative) effect.

Most researchers interpret Bayes Factors relative to some criterion value (e.g., BF > 3 or BF > 5 or BF > 10). These criterion values are just as arbitrary as the .05 criterion for p-values and the only justification for these values that I have seen is that (Jeffrey who invented Bayes Factors said so). There is nothing wrong with a conventional criterion value, even if Bayesian’s think there is something wrong with p < .05, but use BF > 3 in just the same way, but it is important to understand the implications of using a particular criterion value for an inference. In NHST the criterion value has a clear meaning. It means that in the long-run, the rate of false inferences (deciding in favor of H1 when H1 is false) will not be higher than the criterion value.  With alpha = .05 as a conventional criterion, a research community decided that it is ok to have a maximum 5% error rate.  Unlike, p-values, criterion values for Bayes-Factors provide no information about error rates.  The best way to understand what a Bayes-Factor of 3 means is that we can assume that H0 and H1 are equally probable before we conduct a study and a Bayes Factor of 3 in favor of H0 makes it 3 times more likely that H0 is true than that H1 is true. If we were gambling on results and the truth were known, we would increase our winning odds from 50:50 to 75:25.   With a Bayes-Factor of 5, the winning odds increase to 80:20.

HYPOTHESIS TESTING VERSUS EFFECT SIZE ESTIMATION

p-values and BF also share another shortcoming. Namely they provide information about the data given a hypothesis or two hypotheses, but they do not provide information about the data. We all know that we should not report results as “X influenced Y, p < .05”. The reason is that this statement provides no information about the effect size.  The effect size could be tiny, d = 0.02, small, d = .20, or larger, d = .80.  Thus, it is now required to provide some information about raw or standardized effect sizes and ideally also about the amount of  raw or standardized sampling error. For example, standardized effect sizes could be reported as the standardized mean difference and sampling error (d = .3, se = .15) or as a  confidence interval, e.g., (d = .3, 95% CI = 0 to .6). This is important information about the actual data, but it does not provide information about hypothesis tests. Thus, if the results of a study are used to test hypothesis, information about effect sizes and sampling errors has to be evaluated with specified criterion values that can be used to examine which hypothesis is consistent with an observed effect size.

RELATING HYPOTHESIS TESTS TO EFFECT SIZE ESTIMATION

In NHST, it is easy to see how p-values are related to effect size estimation.  A confidence interval around the observed effect size is constructed by multiplying the amount of sampling error by  a factor that is defined by alpha.  The 95% confidence interval covers all values around the observed effect size, except the most extreme 5% values in the tails of the sampling distribution.  It follows that any significance test that compares the observed effect size against a value outside the confidence interval will produce a p-value less than the error criterion.

It is not so straightforward to see how Bayes Factors relate to effect size estimates.  Rouder et al. (2016) discuss a scenario where the 95% credibiltiy interval ranges around the most likey effect size of d = .165 ranges from .055 to .275 and excludes zero.  Thus, an evaluation of the null-hypothesis, d = 0, in terms of a 95%CI would lead to the rejection of the point-zero hypothesis.  We cannot conclude from this evidence that an effect is absent. Rather the most reasonable inference is that the population effect size is likely to be small, d ~ .2.   In this scenario, Rouder et al. (2009) obtained a Bayes-Factor of 1.  This Bayes-Factor also does not support H0, but it also does not provide support for H1.  How is it possible that two Bayesian methods seem to produce contradictory results? One method rejects H0:d = 0 and the other method shows no more support for H1 than for H0:d = 0.

Rouder et al. provide no answer to this question.  “Here we have a divergence. By using posterior credible intervals, we might reject the null, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point” (p. 536).   Moreover, they suggest that Bayes Factors give the correct answer and the rejection of d = 0 by means of credibility intervals is unwarranted. “…, but by using Bayes’ rule directly we see that this rejection is made prematurely as there is no decrease in the plausibility of the zero point.Updating with Bayes’ rule directly is the correct approach because it describes appropriate conditioning of belief about the null point on all the information in the data” (p. 536).

The problem with this interpretation of the discrepancy is that Rouder et al. (2009) misinterpret the meaning of a Bayes Factor as if it can be directly interpreted as a test of the null-hypothesis, d = 0.  However, in more thoughtful articles by the same authors, they recognize that (a) Bayes Factors only provide relative information about H0 in comparison to a specific alternative hypothesis H1, (b) the specification of H1 influences Bayes Factors, (c) alternative hypotheses that give a high a priori probability to large effect sizes favor H0 when the observed effect size is small, and (d) it is always possible to specify an alternative hypothesis (H1) that will not favor H0 by limiting the range of effect sizes to small effect sizes. For example, even with a small observed effect size of d = .165, it is possible to provide strong support for H1 and reject H0, if H1 is specified as Cauchy(0,0.1) and the sample size is sufficiently large to test H0 against H1.

BF.N.r.Plot.png
Figure 1 shows how Bayes Factors vary as a function of the specification of H1 and as a function of sample size with the same observed effect size of d = .165.  It is possible to get Bayes-Factors greater than 3 in favor of H0 with a wide Cauchy (0,1) and a small sample size of N = 100 and a Bayes Factor greater than 3 in favor of H1 with a small scaling factor of .4 or smaller and a sample size of N = 250.  In short, it is not possible to interpret Bayes Factors that favor H0 as evidence for the absence of an effect.  The Bayes Factor only tells us that the observed effect size is more consistent with the data than H1, but it is difficult to interpret this result because H1 is not a clearly specified alternative effect size. H1 changes not only with the specification of the range of effect sizes, but also with sample size.  This property is not a design flaw of Bayes Factors.  They were designed to provide more and more stringent tests of H0:d = 0 that would eventually support H1 if the sample size is sufficiently large and H0:d = 0 is false.  However, if H0 is false and H1 includes many large effect sizes (an ultrawide prior), Bayes Factors will first favor H0 and data collection may stop before Bayes Factors switch and provide the correct result that the population effect size is not zero.   This behavior of Bayes-Factors was illustrated by Rouder et al. (2009) with a simulation of a population effect size of d = .02.

 

BFSmallEffect.png
Here we see that the Bayes Factor favors H0 until sample sizes are above N = 5,000 and provides the correct information about the point hypothesis being false with N = 20,000 or more.To avoid confusion in the interpretation of Bayes Factors and to provide a better understanding of the actual regions of effect sizes that are consistent with H0 and H1, I developed simple R-Code that translates the results of a Bayesian Hypothesis Test into a Neyman Pearson hypothesis test.

TRANSLATING RESULTS FROM A BAYESIAN HYPOTHESIS TEST INTO RESULTS FROM A NEYMAN PEARSON HYPOTHESIS TEST

A typical analysis with BF creates three regions. One region of observed effect sizes is defined by BF > BF.crit in favor of H1 over H0. One region is defined by inconclusive BF with BF < BF.crit in favor of H0 and BF < BF.crit for H1 (1/BF crit < BF(H1/H0) < BF.crit.). The third region is defined by effect sizes between 0 and the effect size that matches the criterion for BF > BF.crit in favor of H0.
The width and location of these regions depends on the specification of H1 (a wider or narrower distribution of effect sizes under the assumption that an effect is present), the sample size, and the long-run error rate, where an error is defined as a BF > BF.crit that supports H0 when H1 is true and vice versa.
I examined the properties of BF for two scenarios. In one scenario researchers specify H1 as a Cauchy(0,.4). The value of .4 was chosen because .4 is a reasonable estimate of the median effect size in psychological research. I chose a criterion value of BF.crit = 5 to maintain a relatively low error rate.
I used a one sample t-test with n = 25, 100, 200, 500, and 1,000. The same amount of sampling error would be obtained in a two-sample design with 4x the sample size (N = 100, 400, 800, 2,000, and 4,000).
bf.crit N bf0 ci.low border ci.high alpha
[1,] 5 25 2.974385 NA NA 0.557 NA
[2,] 5 100 5.296013 0.035 0.1535 0.272 0.1194271
[3,] 5 200 7.299299 0.063 0.1300 0.197 0.1722607
[4,] 5 500 11.346805 0.057 0.0930 0.129 0.2106060
[5,] 5 1000 15.951191 0.048 0.0715 0.095 0.2287873
We see that the typical sample size in cognitive psychology with a within-subject design (n = 25) will never produce a result in favor of H0 and it requires an effect size of d = .56 to produce a result in favor of H1. This criterion is somewhat higher than the criterion effect size for p < .05 (two-tailed), which is d = .41, and approximately the same as the effect size needed for with alpha = .01, d = .56.
With N = 100, it is possible to obtain evidence for H0. If the observed effect size is exactly 0, BF = 5.296, and the maximum observed effect size that produces evidence in favor of H0 is d = 0.035. The minimum effect size needed to support H1 is d = .272. We can think about these two criterion values as limits of a confidence interval around the effect size in the middle (d = .1535). The width of the confidence interval implies that in the long run, we would make ~ 11% errors in favor of H0 and 11% errors in favor of H1, if the population effect size is d = .1535(#1). If we treat d = .1535 as the boundary for an interval null-hypothesis, H0:abs(d) < .1535, we do not make a mistake when the population effect size is less than .1535. So, we can interpret a BF > 5 as evidence for H0:abs(d) < .15, with an 11% error rate. The probability of supporting H0 with a statistically small effect size of d = .2 would be less than 11%. In short, we can interpret BF > 5 in favor of H0 as evidence for abs(d) < .15 and BF > 5 in favor of H1 as evidence for H1:abs(d) > .15, with approximate error rates of 10% and a region of inconclusive evidence for observed effect sizes between d = .035 and d = .272.
The results for N = 200, 500, and 1,000 can be interpreted the same way. An increase in sample size has the following effects: (a) the boundary effect size d.b that separates H0:|d| <= d.b and H1:|d| > d.b shrinks. In the limit it reaches zero and only d = 0 supports H0: |d| <= 0. With N = 1,000, the boundary value is d.b = .048 and an observed effect size of d = .0715 provides sufficient evidence for H1. However, the table also shows that the error rate increases. In larger samples a BF of 5 in one direction or the other occurs more easily by chance and the long-term error rate has doubled. Of course, researchers could keep a fixed error rate by adjusting the BF criterion value to match a fixed error rate, but Bayesian Hypthesis tests are not designed to maintain a fixed error rate. If this were a researchers goal, they could just specify alpha and use NHST to test H0:|d| < d.crit vs. H1:|d| > d.crit.
In practice, many researchers use a wider prior and a lower criterion value. For example, EJ Wagenmakers prefers the original Jeffrey prior with a scaling factor of 1 and a criterion value of 3 as noteworthy (but not definitive) evidence.
The next table translates inferences with a Cauchy(0,1) and BF.crit = 3 into effect size regions.
bf.crit N bf0 ci.low border ci.high alpha
[1,] 3 25 6.500319 0.256 0.3925 0.529 0.2507289
[2,] 3 100 12.656083 0.171 0.2240 0.277 0.2986493
[3,] 3 200 17.812296 0.134 0.1680 0.202 0.3155818
[4,] 3 500 28.080784 0.094 0.1140 0.134 0.3274574
[5,] 3 1000 39.672827 0.071 0.0850 0.099 0.3290325

The main effect of using Cauchy(0,1) to specify H1 is that the border value that distinguishes H0 and H1 is higher. The main effect of using BF.crit = 3 as a criterion value is that it is easier to provide evidence for H0 or H1 at the expense of having a higher error rate.

It is now possible to provide evidence for H0 with a small sample of n = 25 in a one-sample t-test. However, when we translate this finding into ranges of effect sizes, we see that the boundary between H0 and H1 is d = .39.  Any observed effect size below .256 yields a BF in favor of H0. So, it would be misleading to interpret this finding as if a BF of 3 in a sample of n = 25 provides evidence for the point null-hypothesis d = 0.  It only shows that an effect size of d < .39 is more consistent with an effect size of 0 than with effect sizes specified in H1 which places a lot of weight on large effect sizes.  As sample sizes increase, the meaning of BF > 3 in favor of H0 changes. With N = 1,000,  a BF of 3  any effect size larger than .071 does no longer provide evidence for H0.  In the limit with an infinite sample size, only d = 0 would provide evidence for H0 and we can infer that H0 is true. However, BF > 3 in finite sample sizes does not justify this inference.

The translation of BF results into hypotheses about effect size regions makes it clear why BF results in small samples often seem to diverge from hypothesis tests with confidence intervals or credibility intervals.  In small samples, BF are sensitive to specification of H1 and even if it is unlikely that the population effect size is 0 (0 is outside the confidence or credibility interval), the BF may show support for H0 because the effect size is below the criterion value that is needed to support H0.  This inconsistency does not mean that different statistical procedures lead to different inferences. It only means that BF > 3 in favor of H0 RELATIVE TO H1 cannot be interpreted as a test of the hypothesis of d = 0.  It can only be interpreted as evidence for H0 relative to H1 and the specification of H1 influences  which effect sizes provide support for H0.

CONCLUSION

Sir Arthur Eddington (cited by Cacioppo & Berntson, 1994) described a hypothetical
scientist who sought to determine the size of the various fish in the sea. The scientist began by weaving a 2-in. mesh net and setting sail across the seas. repeatedly sampling catches and carefully measuring. recording. and analyzing the results of each catch. After extensive sampling. the scientist concluded that there were no fish smaller than 2 in. in the sea.

The moral of this story is that a scientists method influences their results.  Scientists who use p-values to search for significant results in small samples, will rarely discover small effects and may start to believe that most effects are large.  Similarly, scientists who use Bayes-Factors with wide priors may delude themselves that they are searching for small and large effects and falsely believe that effects are either absent or large.  In both cases, scientists make the same mistake.  A small sample is like a net with large holes that can only (reliably) capture big fish.  This is ok, if the goal is to capture only big fish, but it is a problem when the goal is to find out whether a pond contains any fish at all.  A wide net with big holes may never lead to the discovery of a fish in the pond, while there are plenty of small fish in the pond.

Researchers therefore have to be careful when they interpret a Bayes Factor and they should not interpret Bayes-Factors in favor of H0 as evidence for the absence of an effect. This fallacy is just as problematic as the fallacy to interpret a p-value above alpha (p > .05) as evidence for the absence of an effect.  Most researchers are aware that non-significant results do not justify the inference that the population effect size is zero. It may be news to some that a Bayes Factor in favor of H0 suffers from the same problem.  A Bayes-Factor in favor of H0 is better considered a finding that rejects the specific alternative hypothesis that was pitted against d = 0.  Falsification of this specific H1 does not justify the inference that H0:d = 0 is true.  Another model that was not tested could still fit the data better than H0.

Bayes Ratios: A Principled Approach to Bayesian Hypothesis Testing

 

This post is a stub that will be expanded and eventually be turned into a manuscript for publication.

 

I have written a few posts before that are critical of Bayesian Hypothesis Testing with Bayes Factors (Rouder et al.,. 2009; Wagenmakers et al., 2010, 2011).

The main problem with this approach is that it typically compares a single effect size (typically 0) with an alternative hypothesis that is a composite of all other effect sizes. The alternative is often specified as a weighted average with a Cauchy distribution to weight effect sizes.  This leads to a comparison of H0:d=0 vs. H1:d=Cauchy(es,0,r) with r being a scaling factor that specifies the median absolute effect size for the alternative hypothesis.

It is well recognized by critics and proponents of this test that the comparison of H0 and H1 favors H0 more and more as the scaling factor is increased.  This makes the test sensitive to the specification of H1.

Another problem is that Bayesian hypothesis testing either uses arbitrary cutoff values (BF > 3) to interpret the results of a study or asks readers to specify their own prior odds of H0 and H1.  I have started to criticize this approach because the use of a subjective prior in combination with an objective specification of the alternative hypothesis can lead to false conclusions.  If I compare H0:d = 0 with H1:d = .2, I am comparing two hypothesis with a single value.  If I am very uncertain about the results of a study , I can assign an equal prior probability to both effect sizes and the prior odds of H0/H1 are .5/.5 = 1. Thus, a Bayes Factor can be directly interpreted as the posterior odds of H0 and H1 given the data.

Bayes Ratio (H0/H1) = Prior Odds (H0,H1) * Bayes Factor (H0/H1)

However, if I increase the range of possible effect sizes for H1 and I am uncertain about the actual effect sizes, the a priori probability increases, just like my odds of winning increases when I disperse my bet on several possible outcomes (lottery numbers, horses in the Kentucky derby, or numbers in a roulette game).  Betting on effect sizes is no different and the prior odds in favor of H1 increase the more effect sizes I consider plausible.

I therefore propose to use the prior distribution of effect sizes to specify my uncertainty about what could happen in a study. If I think, the null-hypothesis is most likely, I can weight it more than other effect sizes (e.g., with a Cauchy or normal distribution centered at 0).   I can then use this distribution to compute (a) the prior odds of H0 and H1, and (b) the conditional probabilities of the observed test statistic (e.g., a t-value) given H0 and H1.

Instead of interpreting Bayes Factors directly, which is not Bayesian, and confuses conditional probabilities of data given hypothesis with conditional probabilities of hypotheses given data,  Bayes-Factors are multiplied with the prior odds, to get Bayes Ratios, which many Bayesians consider to be the answer to the real question researchers want to answer.  How much should I believe H0 or H1 after I collected data and computed a test-statistic like a t-value?

This approach is more principled and Bayesian than the use of Bayes Factors with arbitrary cut-off values that are easily misinterpreted as evidence for H0 or H1.

One reason why this approach may not have been used before is that H0 is often specified as a point-value (d = 0) and the a priori probability of a single point effect size is 0.  Thus, the prior odds (H0/H1) are zero and the Bayes Ratio is also zero.  This problem can be avoided by restricting H1 to a reasonably small range of effect sizes and by specifying the null-hypothesis as a small range of effect sizes around zero.  As a result, it becomes possible to obtain non-zero prior odds for H0 and to obtain interpretable Bayes Ratios.

The inferences based on Bayes Ratios are not only more principled than those based on Base Factors,  they are also more in line with inferences that one would draw on the basis of other methods that can be used to test H0 and H1 such as confidence intervals or Bayesian credibility intervals.

For example, imagine a researcher who wants to provide evidence for the null-hypothesis that there are no gender differences in intelligence.   The researcher decided a priori that small differences of less than 1.5 IQ points (0.1 SD) will be considered as sufficient to support the null-hypothesis. He collects data from 50 men and 50 women and finds a mean difference of 3 IQ points in one or the other direction (conveniently, it doesn’t matter in which direction).

The t-value with a standardized mean difference of d = 3/15d = .2, and sampling error of SE = 2/sqrt(100) = .2 is t = .2/2 = 1.  A t-value of 1 is not statistically significant. Thus, it is clear that the data do not provide evidence against H0 that there are no gender differences in intelligence.  However, do the data provide positive sufficient evidence for the null-hypothesis?   p-values are not designed to answer this question.  The 95%CI around the observed standardized effect size is -.19 to .59.  This confidence interval is wide. It includes 0, but it also includes d = .2 (a small effect size) and d = .5 (a moderate effect size), which would translate into a difference by 7.5 IQ points.  Based on this finding it would be questionable to interpret the data as support for the null-hypothesis.

With a default specification of the alternative hypothesis with a Cauchy distribution scaled to 1,  the Bayes-Factor (H0/H1) favors H0 over H1  4.95:1.   The most appropriate interpretation of this finding is that the prior odds should be updated by a factor of 5:1 in favor of H0, whatever these prior odds are.  However, following Jeffrey’s many users who compute Bayes-Factors interpret Bayes-Factors directly with reference to Jeffrey’s criterion values and a value greater than 3 can be and has been used to suggest that the data provide support for the null-hypothesis.

This interpretation ignores that the a priori distribution of effect sizes allocates only a small probability (p = .07) to H0 and a much larger area to H1 (p = .93).  When the Bayes Factor is combined with the prior odds (H0/H1) of .07/.93 = .075/1,   the resulting Bayes Ratio shows that support for H0 increased, but that it is still more likely that H1 is true than that H0 is true,   .075 * 4.95 = .37.   This conclusion is consistent with the finding that the 95%CI overlaps with the region of effect sizes for H0 (d = -.1, .1).

We can increase the prior odds of H0 by restricting the range of effect sizes that are plausible under H1.  For example, we can restrict effect sizes to 1 or we can set the scaling parameter of the Cauchy distribution to .5. This way, 50% of the distribution falls into the range between d = -.5 and .5.

The t-value and 95%CI remain unchanged because they do not require a specification of H1.  By cutting the range of effect sizes for H1 roughly in half (from scaling parameter 1 to .5), the Bayes-Factor in favor of H0 is also cut roughly in half and is no longer above the criterion value of 3, BF (H0/H1) = 2.88.

The change of the alternative hypothesis has the opposite effect on prior odds. The probability of H0 nearly doubled (p = .13) and the prior odds are now .13/.87 = .15.  The resulting Bayes Ratio in favor of H0 remains similar to the Bayes Ratio with the wider Cauchy distribution, Bayes Ratio = .15 * 2.88 = 0.45.  In fact, it actually is a bit stronger than the Bayes Ratio with the wider specification of effect sizes (BR (H0/H1) = .45.  However, both Bayes Ratios lead to the same conclusion that is also consistent with the observed effect size, d = .2, and the confidence interval around it, d = -.19 to d = .59.  That is, given the small sample size, the observed effect size provides insufficient information to draw any firm conclusions about H0 or H1. More data are required to decide empirically which hypothesis is more likely to be true.

The example used an arbitrary observed effect size of d = .2.  Evidently, effect sizes much larger than this would lead to the rejection of H0 with p-values, confidence intervals, Bayes Factor, or Bayes-Ratios.  A more interesting question is what the results would be like if the observed effect size would have provided maximum support for the null-hypothesis, which assumes an observed effect size of 0, which also produces a t-value of 0.   With the default prior of Cauchy(M=0,V=1), the Bayes-Factor in favor of H0 is 9.42, which is close to the next criterion value of BF > 10 that is sometimes used to stop data collection because the results are decisive.  However, the Bayes Ratio is still slightly in favor of H1, BR (H1/H0) = 1.42.  The 95%CI ranges from -.39 to .39 and overlaps with the criterion range of effect sizes in the range from -.1 to .1.   Thus, the Bayes Ratio shows that even an observed effect size of 0 in a sample of N = 100 provides insufficient evidence to infer that the null-hypothesis is true.

When we increase sample size to N = 2,000,  the 95%CI around d = 0 ranges from -.09 to .09.  This finding means that the data support the null-hypothesis and that we would make a mistake in our inferences that use the same approach in no more than 5% of our tests (not just those that provide evidence for H0, but all tests that use this approach).  The Bayes-Factor also favors H0 with a massive BF (H0/H1) = 711..27.   The Bayes-Ratio also favors H0, with a Bayes-Ratio of 53.35.   As Bayes-Ratios are the ratio of two complementary probabilities p(H0) + p(H1) = 1, we can compute the probability of H0 being true with the formula  BR(H0/H1) / (Br(H0/H1) + 1), which yields a probability of 98%.  We see how the Bayes-Ratio is consistent with the information provided by the confidence interval.  The long-run error frequency for inferring H0 from the data was less than 5% and the probability of H1 being true given the data is 1-.98 = .02.

Conclusion

Bayesian Hypothesis Testing has received increased interest among empirical psychologists, especially in situations when researchers aim to demonstrate the lack of an effect.  Increasingly, researchers use Bayes-Factors with criterion values to claim that their data provide evidence for the null-hypothesis.  This is wrong for three reasons.

First, it is impossible to test a hypothesis that is specified as one effect size out of an infinite number of alternative effect sizes.  Researchers appear to be confused that Bayes Factors in favor of H0 can be used to suggest that all other effect sizes are implausible. This is not the case because Bayes Factors do not compare H0 to all other effect sizes. They compare H0 to a composite hypotheses of all other effect sizes and Bayes Factors depend on the way the composite is created. Falsification of one composite does not ensure that the null-hypothesis is true (the only viable hypothesis still standing) because other composites can still fit the data better than H0.

Second, the use of Bayes-Factors with criterion values also suffers from the problem that it ignores the a priori odds of H0 and H1.  A full Bayesian inferences requires to take the prior odds into account and to compute posterior odds or Bayes Ratios.  The problem for the point-null hypothesis (d = 0) is that the prior odds for H0 over H1 is 0. The reason is that the prior distribution of effect sizes adds up to 1 (the true effect size has to be somewhere), leaving zero probability for d = 0.   It is possible to compute Bayes-Factors for d = 0 because Bayes-Factors use densities. For the computation of Bayes Factors the distinction between densities and probabilities is not important, but the for the computation of prior odds, the distinction is important.  A single effect size has a density on the Cauchy distribution, but it has zero probability.

The fundamental inferential problem of Bayes-Factors that compare H0:d=0 can be avoided by specifying H0 as a critical region around d=0.  It is then possible to compute prior odds based on the area under the curve for H0 and the area under the curve for H1. It is also possible to compute Bayes Factors for H0 and H1 when H0 and H1 are specified as complementary regions of effect sizes.  The two ratios can be multiplied to obtain a Bayes Ratio. Furthermore, Bayes Ratios can be used as the probability of H0 given the data and the probability of H1 given the data.  The results of this test are consistent with other approaches to the testing of regional null-hypothesis and they are robust to misspecifications of the alternative hypothesis that allocate to much weight to large effect sizes.   Thus, I recommend Bayes Ratios for principled Bayesian Hypothesis testing.

 

*************************************************************************

R-Code for the analyses reported in this post.

*************************************************************************

#######################
### set input
#######################

### What is the total sample size?
N = 2000

### How many groups?  One sample or two sample?
gr = 2

### what is the observed effect size
obs.es = 0

### Set the range for H0, H1 is defined as all other effect sizes outside this range
H0.range = c(-.1,.1)  #c(-.2,.2) # 0 for classic point null

### What is the limit for maximum effect size, d = 14 = r = .99
limit = 14

### What is the mode of the a priori distribution of effect sizes?
mode = 0

### What is the variability (SD for normal, scaling parameter for Cauchy) of the a priori distribution of effect sizes?
var = 1

### What is the shape of the a priori distribution of effect sizes
shape = “Cauchy”  # Uniform, Normal, Cauchy  Uniform needs limit

### End of Input
### R computes Likelihood ratios and Weighted Mean Likelihood Ratio (Bayes Factor)
prec = 100 #set precision, 100 is sufficient for 2 decimal
df = N-gr
se = gr/sqrt(N)
pop.es = mode
if (var > 0) pop.es = seq(-limit*prec,limit*prec)/prec
weights = 1
if (var > 0 & shape == “Cauchy”) weights = dcauchy(pop.es,mode,var)
if (var > 0 & shape == “Normal”) weights = dnorm(pop.es,mode,var)
if (var > 0 & shape == “Uniform”) weights = dunif(pop.es,-limit,limit)
H0.mat = cbind(0,1)
H1.mat = cbind(mode,1)
if (var > 0) H0.mat = cbind(pop.es,weights)[pop.es >= H0.range[1] & pop.es <= H0.range[2],]
if (var > 0) H1.mat = cbind(pop.es,weights)[pop.es < H0.range[1] | pop.es > H0.range[2],]
H0.mat = matrix(H0.mat,,2)
H1.mat = matrix(H1.mat,,2)
H0 = sum(dt(obs.es/se,df,H0.mat[,1]/se)*H0.mat[,2])/sum(H0.mat[,2])
H1 = sum(dt(obs.es/se,df,H1.mat[,1]/se)*H1.mat[,2])/sum(H1.mat[,2])
BF10 = H1/H0
BF01 = H0/H1
Pr.H0 = sum(H0.mat[,2]) / sum(weights)
Pr.H1 = sum(H1.mat[,2]) / sum(weights)
PriorOdds = Pr.H1/Pr.H0
Bayes.Ratio10 = PriorOdds*BF10
Bayes.Ratio01 = 1/Bayes.Ratio10
### R creates output file
text = c()
text[1] = paste0(‘The observed t-value with d = ‘,obs.es,’ and N = ‘,N,’ is t(‘,df,’) = ‘,round(obs.es/se,2))
text[2] = paste0(‘The 95% confidence interal is ‘,round(obs.es-1.96*se,2),’ to ‘,round(obs.es+1.96*se,2))
text[3] = paste0(‘Weighted Mean Density(H0:d >= ‘,H0.range[1],’ & <= ‘,H0.range[2],’) = ‘,round(H0,5))
text[4] = paste0(‘Weighted Mean Density(H1:d <= ‘,H0.range[1],’ | => ‘,H0.range[2],’) = ‘,round(H1,5))
text[5] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H0/H1: ‘,round(BF01,2))
text[6] = paste0(‘Weighted Mean Likelihood Ratio (Bayes Factor) H1/H0: ‘,round(BF10,2))
text[7] = paste0(‘The a priori likelihood ratio of H1/H0 is ‘,round(Pr.H1,2),’/’,round(Pr.H0,2),’ = ‘,round(PriorOdds,2))
text[8] = paste0(‘The Bayes Ratio(H1/H0) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio10,2))
text[9] = paste0(‘The Bayes Ratio(H0/H1) (Prior Odds x Bayes Factor) is ‘,round(Bayes.Ratio01,2))
### print output
text

 

 

 

 

Lottery

How Does Uncertainty about Population Effect Sizes Influence the Probability that the Null-Hypothesis is True?

There are many statistical approaches that are often divided into three schools of thought: (a) Fisherian, (b) Neyman-Pearsonion, and (c) Bayesian.  This post is about Bayesian statistics.  Within Bayesian statistics, there are further distinctions that can be made. One distinction is between Bayesian parameter estimation (credibility intervals) and Bayesian hypothesis testing.  This post is about Bayesian hypothesis testing.  One goal of As one goal of Bayesian Hypothesis testing is to provide evidence for the null-hypothesis.  It is often argued that Baysian Null-Hypothesis Testing (BNHT) is superior to the widely used method of Null-Hypothesis Testing with p-values.  This post is about the ability of BNHT to test the null-hypothesis.

The crucial idea of BNHT is that it is possible to contrast the null-hypothesis (H0) with an alternative hypothesis (H1) and to compute the relative likelihood that the data support one hypothesis versus the other:  p(H0/D) / p(H1/D).  If this ratio is large enough (e.g, p(H0/D) / p(H1/D) > criterion),  it can be stated that the data support the null-hypothesis more than the alternative hypothesis.

To compute the ratio of the two conditional probabilities, researchers need to quantify two ratios.  One ratio is the prior ratio of the probabilities that H0 or H1 are true: p(H0)/p(H1. This ratio does not have a common name. I call it the probability ratio (PR).  The other ratio is the ratio of the conditional probabilities of the data given H0 and H1. This ratio is often called a Bayes Factor (BF): BF = p(D/H0)/p(D/H1).

To make claims about H0 and H1 based on some observed test statistic,  the Probability Ratio has to be multiplied with the Bayes Factor.

p(H0/D)                 p(H0)   x  p(D/H0)
________   =   _______________     =   PR * BF
p(H1/D)                  p(H1)   x   p(D/H1)
The main reason for calling this approach Bayesian is that Bayesian statisticians are willing and required to specify a priori probabilties of hypotheses before any data are collected.  In the formula above, p(H0) and p(H1) are the a priori probabilities that a population effect size is 0 (p(H0) or that it is some other value, p(H1).  However, in practice BNHT is often used without specifying these a priori probabilities.

“Table 1 provides critical t values needed for JZS Bayes factor values of 1/10, 1/3, 3, and 10 as a function of sample size. This table is analogous in form to conventional t-value tables for given p value criteria. For instance, suppose a researcher observes a t value of 3.3 for 100 observations. This t value favors the alternative and corresponds to a JZS Bayes factor less than 1/10 because it exceeds the critical value of 3.2 reported in the table. Likewise,
suppose a researcher observes a t value of 0.5. The corresponding JZS Bayes factor is greater than 10 because the t value is smaller than 0.69, the corresponding critical value in Table 1. Because the Bayes factor is directly interpretable as an odds ratio, it may be reported without reference to cutoffs such as 3 or 1/10. Readers may decide the meaning of odds ratios for themselves” (Rouder et al., 2009).

The use of arbitrary cutoff values (3 or 10) for Bayes Factors is not a complete Bayesian statistical analysis because it does not provide information about the hypothesis given the data. Bayes Factors alone only provide information about the ratio of the conditional probabilities of the data given two alternative hypothesis and the ratios are not equivalent.

p(H0/D)                 p(D/H0)
________   ≠   _________
p(H1/D)                  p(D/H1)

In practice, users of BNHT are unaware or ignore the need to think about the base rates of H0 and H1, when they interpret Bayes Factors.  The main point of this post is to demonstrate that Bayes Factors that compare the null-hypothesis of a single effect size against an alternative hypothesis that combines many effect sizes (all effect sizes that are not zero) can be deceptive because the ratio of p(H0) / p(H1) decreases as the number of effect sizes increases.  In the limit the a priori probability of the null-hypothesis being true is zero, which implies that no data can provide evidence for it because any Bayes-Factor that is multiplied with zero is zero, which implies that it is reasonable to believe in the alternative hypothesis no matter how strongly a Bayes Factor favors the null-hypothesis.

The following urn experiments explains the logic of my argument, points out a similar problem in the famous Monty Hall problem, and provides r-code to run simulations with different assumptions about the number and distribution of effect sizes and the implications for the probability ratio of H0 and H1 and Bayes Factors that are need to provide evidence for the null-hypothesis.

An Urn Experiment of Population Effect Sizes

The classic example in statistics are urn experiments.  An urn is filled with balls with different colors. If the urn is filled with 100 balls and only one ball is red and you get one chance to draw a ball from the urn without peeking, the probability of you drawing the red ball is 1 out of 100 or 1%.

To think straight about statistics and probabilities it is helpful, unless you are some math genius who can really think in 10 dimensions, to remind yourself that even complicated probability problems are essentially urn experiments. The question is only what the urn experiment would look like.

In this post, I am examining the urn experiment that corresponds to the problem of Bayesian statisticians to specify probabilities of effect sizes in experiments without any information that would be helpful to guess which effect size is most likely.

To translate the Bayesian problem of the prior into an urn experiment, we first have to turn effect sizes into balls.  The problem is that effect sizes are typically continuous, but an urn can only be filled with discrete objects.  The solution to this problem is to cut the continuous range of effect sizes into discrete units.  The number of units depends on the desired precision.  For example, effect sizes can be measured in standardized units with one decimal, d = 0, d = .1, d = .2, etc.  or with two decimals, d = .00, d = .01, d = .02, etc.  or with 10 decimals.  The more precise the measurement, the more discrete events are created.  Instead of using colors, we can use balls with numbers printed on them as you may have seen in lottery draws.   In psychology, theories and empirical studies often are not very precise and it would hardly be meaningful to distinguish between an effect size of d = .213 and an effect size of d = .214.  Even two decimals are rarely needed and the typical sampling error in psychological studies of d = .20, would make it impossible to distinguish between d = .33 and d = .38 empirically.  So, it makes sense to translate the continuous range of effect sizes into balls with one digit numbers, d = .0, d = .1, d = .2.

The second problem is that effect sizes can be positive or negative.  This is not really a problem because some balls can have negative numbers printed on them.  However, the example can be generalized from the one-sided scenario with only positive effect sizes to a two-sided scenario that also includes negative effects. To keep things simple, I use only positive effect sizes in this example.

The third problem is that some effect size measures are unlimited. However, in practice it is unreasonable to expect very large effect sizes and it is possible to limit the range of possible effect sizes at a maximum value.  The limit could be d = 10, d = 5, or d = 2.  For this example, I use a limit of d = 2.

It is now possible to translate the continuous measure of standardized effect sizes into 21 discrete events and to fill the urn with 21 balls that have printed the numbers 0, 0.1, 0.2, …., 2.0 printed on them.

The main point of Bayesian inference is to draw conclusions about the probability that a particular hypothesis is true given the results of an empirical study.  For example, how probable is it that the null-hypothesis is true when I observe an effect size of d = .2?  However, a study only provides information about the data given a specific hypothesis. How probable is it to observe an effect size of d = .2, if the null-hypothesis were true?  To answer the first question, it is necessary to specify the probability that the hypothesis is true independent of any data; that is, how probable is it that the null-hypothesis is true?

P(pop.es=0/obs.es = .2) = P(pop.es=0) * P(Obs.ES=.2/Pop.ES=0) / P(Obs.ES = .20)

This looks scary and for this post you do not need to understand the complete formula ,but it is just a mathematical way of saying that the probability that a population effect size (pop.es) is zero when the observed effect size (obs.es) is d = .2 equals the unconditional probability that the population effect size is zero multiplied by the conditional probability of observing an effect size of d = .2 when the population effect size is 0 divided by the unconditional probability of observing an effect size of d = .2.

I only show this formula to highlight the fact that the main goal of Bayesian inference is to estimate the probability of a hypothesis (in this case, pop.es = 0) given some observed data (in this case, obs.es = .20) and that researchers need to specify the unconditional probability of the hypothesis (pop.es = 0) to do so.

We can now return to the urn experiment and ask the question how likely it is that a particular hypothesis is true. For example, how likely is it that the null-hypothesis is true?  That is, how likely is it that we end up with a ball that has the number 0.0 printed on it when conduct a study with an unknown population effect size? The answer is: it depends.  It depends on the way our urn was filled.  We of course do not know how often the null-hypothesis is true, but we can fill the urn in a way that expresses maximum uncertainty about the probability that the null-hypothesis is true.  Maximum uncertainty means that all possible events are equally likely (Bayesian statisticians actually use a so-called uniform prior when the range of possible outcomes is fixed).  So, we can fill the urn with one ball for each of the 21 effect sizes (0.0, 0.1, 0.2,….. 2.0).   Now it is fairly easy to determine the a priori probability that the null-hypothesis is true.  There are 21 balls and you are drawing one ball from the urn.  Thus, the a priori probability of the null-hypothesis being true is 1/21 = .047.

As noted before, if the range of events increases because we specify a wider range of effect sizes (say effect sizes up to 10), the a priori probability of drawing the ball with 0.0 printed on it decreases. If we specify effect sizes with more precision (e.g., two digits), the probability of drawing the ball that has 0.00 printed on it decreases further.  With effect sizes ranging from 0 to 10 and being specified with two digits, there are 1001 balls in the urn and the probability of drawing the ball with 0.00 printed on it is 0.001.  Thus, even if the data would provide strong support for the null-hypothesis, the proper inference has to take into account that a priori it is very unlikely that a randomly drawn study had an effect size of 0.00.

As effect sizes are continuous and theoretically can range from -infinity to +infinity, there is an infinite number of effect sizes and the probability of drawing a ball with 0 printed on it from an infinitely large urn that is filled with an infinite number of balls is zero (1/infinity).  This would suggest that it is meaningless to test the hypothesis whether the null-hypothesis is true or not because we already know the answer to the question; the probability is zero. As any number that is multiplied by 0 is zero, the probability that the population effect size is zero remains zero, even if the probability that the population effect size is 0 when we observed an effect size of 0 is 1.   Of course, this is also true for any other hypothesis about effect sizes greater than zero. The probability that the effect size is exactly d = .2 is also 0.  The implication is simply that it is not possible to empirically test hypotheses when the range of effect sizes is cut into an infinite number of pieces because the a priori probability that the effect size has a specific size is always 0.   This problem can be solved by limiting the number of balls in the urn so that we avoid the problem of drawing from an infinitely large urn with an infinite number of balls.

Bayesians solve the infinity problem by using mathematical functions.  A commonly used function was proposed by Jeffrey’s.  Jeffrey’s proposed to specify uncertainty about effect sizes with a Cauchy distribution with a scaling parameter of 1.  Figure 1 shows the distribution.

JeffreyPrior.png

The figure is cut off at effect sizes smaller than -10 and larger than 10, and it assumes that effect sizes are measured with two digits.  With two decimals, the densities can be interpreted as percentages and sum to 100.  The sum of the probabilities for effect sizes in the range between -10 and 10 covers only 93.66% of the full distribution. The remaining 6.34% are in the tails below -10 and above 10. As you can see, the distribution is not uniform. It actually gives the highest probability to an effect size of 0. The probability density for an effect size of 0 is 0.32 and translates into a probability of 0.32% with two digits as units for the effect size.   By eliminating these extreme effect sizes, the probability of the null-hypothesis increases slightly from 0.32% to 0.32/93.66*100 = 0.34%. With two decimals, there are 2001 effect sizes (-10, -9.99, …..-0.01, 0, 0.1….,9.99,10). A uniform prior would put the probability of a single effect size at 1/2001 = 0.05%.  This shows that Jeffrey’s prior gives a higher probability to the null-hypothesis, but it also does so for other small effect sizes close to zero.  The probability density of observing an effect size of d = 0.01 is only slightly smaller, d = .31827, than the probability of the null-hypothesis, d = .3183.

If we translate Jeffrey’s prior for effect sizes with two digits into an urn experiment, and we filled the urn proportionally to the distribution in Figure 1 with 10,000 balls, 34 balls would have the number 0.00 printed on them.  When we draw one ball from the urn, the probability of drawing one of the 34 balls with 0.00, is 34/10000 = 0.0034 or 0.34%.

Bayesian statisticians do not use probability densities to specify the probability that the population effect size is zero, possibly because probability densities do not directly translate into probabilities and the unit.  However, by treating effect sizes as a continuous variable, the number of balls in the lottery is infinite and the probability of drawing a ball with 0.0000000000000000 printed on it is practically zero.  A reasonable alternative is to specify a reasonable unit for effect sizes.  As noted earlier, for many psychological applications, a reasonable unit is a single digit (d = 0, d = .1, d = .2, etc.).  This implies that effect sizes between d = -.05 and d = .05 are essentially treated as 0.

Given Jeffrey’s distribution, the rational specification of the a prior probabilities that the effect size is 0 or somewhere between -10 and 10 is

P(pop.es = 0)                      0.32                                     1
___________   =    _____________      =    ______

P(pop.es ≠ 0)                   9.37 – 0.32                           28
To draw statistical inferences Bayesian Null Hypothesis Tests uses the Bayes-Factor.  Without going into details here, a Bayes-Factor provides the complementary ratio of the conditional probabilities of data based on the null-hypothesis or the alternative hypothesis.  It is not uncommon to use a Bayes-Factor of 3 or greater as support for one of the two hypotheses.  However, if we take the prior probabilities of these hypothesis into account a Bayes-Factor of 3 does not justify a belief in the null-hypothesis, nor is it sufficiently strong to overcome the low probability that the null-hypothesis is true given the large uncertainty about effect sizes. A Bayes-Factor of 3 would change the probability of 1/28 into a probability of 3/28 = .11.  Thus, it is still unlikely that the effect size is zero.  A Bayes-Factor of 28 in favor of H0, would be needed to make it equally likely that the null-hypothesis is true and that it is not true and to assert that the null-hypothesis is true with a probability of 90%, the Bayes-Factor would have to be 255; 255/28 = 9 = .90/.10.

It is possible to further decrease the number of balls in the lottery. For example, it is possible to set the unit to 1. This gives only 11 effect sizes (-10, -9, -8,…,-1,0,1,…8,9,10).  The probability density of .32 translates now in a .32 probability, versus a .68 probability for all other effect sizes. After adjusting for the range restriction, this translates into a ratio of 1.95 to 1 in favor of the alternative.  Thus, a Bayes-Factor of 3 would favor the null-hypothesis and it would only require a Bayes-Factor of 18 to obtain a probability of .90 that H0 is true,  18/1.95 = 9 = .90/.10.   However, it is important to realize that the null-hypothesis with d = 1 covers effect sizes in the range from -.5 to .5.   This wide range covers effect sizes that are typical for psychology and are commonly called small or moderate effects.  As a result, this is not a practical solution because the test no longer really tests the hypothesis that there is no effect.

In conclusion, Jeffrey’s proposed a rational approach to specify the probability of population effect sizes without any data and without prior information about effect sizes.  He proposed a prior distribution of population effect sizes that covers a wide range of effect sizes.  The cost of working with this prior distribution of effect sizes under maximum uncertainty is that a wide range of effect sizes are considered to be plausible. This means that there are many possible events and the probability of any single event is small.  Jeffrey’s prior makes it possible to quantify this probability as a function of the density of an effect size and the precision of measurement of effect sizes (number of digits).  This probability should be used to evaluate Bayes-Factors.  Contrary to existing norms, Bayes-Factors of 3 or 10 cannot be used to claim that the data favor the null-hypothesis over the alternative hypothesis because this interpretation of Bayes-Factors ignore that without further information it is more likely that the null-hypothesis is false than that it is correct.   It seems unreasonable to assign equal probabilities to two events, where one event is akin to drawing a single red ball from an urn when the other event is to draw all but that red ball from an urn.  As the number of balls in the urn increases, these probabilities become more and more unequal.  Any claim that the null-hypothesis is equally or more probable than other effects would have to be motivated by prior information, which would invalidate the use of Jeffrey’s distribution of effect sizes that was developed for a scenario where prior information is not available.

Postscript or Part II

One of the most famous urn experiments in probability theory is the Monty Hall problem.

Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?

I am happy to admit that I got this problem wrong.  I was not alone.  In a public newspaper column, Vos Savant responded that it would be advantageous to switch because the probability of winning after switching is 2/3, whereas sticking to your guns and staying with the initial choice has only a 1/3 choice of winning.

This column received 10,000 responses with 1,000 responses by readers with a Ph.D. who argued that the chances are 50:50.  This example shows that probability theory is hard even when you are formally trained in math or statistics.  The problem is to match the actual problem to the appropriate urn experiment. Once the correct urn experiment has been chosen, it is easy to compute the probability.

Here is how I solved the Monty Hall problem for myself.  I increased the number of doors from 3 to 1,000.  Again, I have a choice to pick one door.  My chance of picking the correct door at random is now 1/1000 or 0.001.  Everybody can realize that it is very unlikely that I picked the correct door by chance. If 1000 doors do not help, try 1,000,000 doors.  Let’s assume I picked a door with a goat, which has a probability of 999/1000 or 99.9%.  Now the gameshow host will open 998 other doors with goats and the only door that he does not open is the door with the car.  Should I switch?  If intuition is not sufficient for you, try the math.  There is a 99.9% probability to pick a door with a goat and if this happens, the probability that the other door has the car is 1.  There is a 1/1000 = 0.1% probability that I picked the door with the car and if I did so, the probability that the door that I picked has the car is 1.  So, you have a 0.1% chance of winning if you stay and a 99.9% chance of winning if you switch.

The situation is the same when you have three doors.  There is a 2/3 chance that you randomly pick a door with a goat. Now the gameshow host opens the only other door with a goat and the other door must have the car.  If you picked the door with the car, the game show host will open one of the two doors with a goat and the other door still has a goat behind it.  So, you have a 2/3 chance of winning if you switch and a 1/3 chance of winning when you stay.

What does all of this have to do with Bayesian statistics?  There is a similarity between the Monty Hall problem and Bayesian statistics.   If we would only consider two effect sizes, say d = 0 and d = .2, we would have an equal probability that either one is the correct effect size without looking at any data and without prior information.  The odds of the null-hypothesis being true versus the alternative hypothesis being true are 50:50.   However, there are many other effect sizes that are not being considered. In Bayesian hypothesis testing these non-null effect sizes are combined in a single alternative hypothesis that the effect size is not 0 (e.g., d = .1, d = .2, d = .3, etc.).  If we limit our range of effect sizes to effect sizes between -10 and 10 and specify effect sizes with one digit precision we end up with 201 effect sizes, one effect size is 0 and the other effect sizes are not zero.  The goal is to find the actual population effect size by collecting data and by conducting a Bayesian hypothesis test. If you do find the correct population effect size, you win a Noble Prize, if you are wrong you get ridiculed by your colleagues.  Bayesian null-hypothesis tests proceed like in a Monty Hall game show by picking one effect size at random. Typically, this effect size is 0.  They could have picked any other effect size at random, but Bayes-Factors are typically used to test the null-hypothesis.  After collecting some data, the data provide information that increase the probability for some effect sizes and further decrease the probability of other effect sizes.  Imagine an illuminated display of the 201 effect sizes and the game show host turns some effect sizes green or red.  Even Bayesians would abandon their preferred randomly chosen effect size of 0, if it would turn red.  However, let’s consider a scenario where 0 and 20 other effect sizes (e.g., 0.1, 0.2, 0.3, etc. ) are still green.  Now the gameshow host gives you a choice. You can either stay with 0 or you can pick all other 20 effect sizes that are flashing green.  You are allowed to pick all 20 because they are combined in a single alternative hypothesis that the effect size is not zero. It doesn’t matter what the effect size is. It only matters that they are not zero.  Bayesians who simply look to the Bayes Factor (what the data say) and accept the null-hypothesis ignore that the null-hypothesis is only one out of several effect sizes that are compatible with the data and they ignore that a priori it is unlikely that they picked the correct effect size when they pitted a single effect size against all other.

Why would Bayesians do such a crazy thing, when it is clear that you have a much better chance of winning if you can bet on 20 out of 21 effect sizes rather than 1 out of 21 and the winning odds for switching are 20:1?

Maybe they suffer from a similar problem as many people who vehemently argued that the correct answer to the Monty Hall problem is 50:50. The reason for this argument is simply that there are two doors. It doesn’t matter how we got there. Now that we are facing the final decision, we are left with two choices.  The same illusion may occur when we express Bayes-Factors as odds for two hypotheses and ignore the asymmetry between the two hypothesis that one hypothesis consists of a single effect size and the other hypothesis consists of all other effect sizes.

They may forget that in the beginning they picked zero at random from a large set of possible effect sizes and that it is very unlikely that they picked the correct effect size in the beginning.  This part of the problem is fully ignored when researchers compute Bayes-Factors and directly interpret Bayes-Factors.  This is not even Bayesian because the Bayes theorem explicitly requires to specify the probability of the randomly chosen null-hypothesis to draw valid inferences.  This is actually the main point of the Bayes theorem. Even when the data favor the null-hypothesis, we have to consider the a priori probability that the null-hypothesis is true (i.e., the base rate of the null-hypothesis).   Without a value for p(H0) there is no Bayesian inference.  One solution is to simply to assume that p(H0) and p(H1) are equally likely. In this case, a Bayes-Factor that favors the randomly chosen effect size would mean it is rational to stay with it.  However, the 50:50 ratio does not make sense because it is a priori more likely that one of the effect sizes of the alternative hypothesis is the right on.  Therefore, it is better to switch and reject the null-hypothesis.  In this sense, Bayesians who interpret Bayes-Factors without taking the base-rate of H0 into account are not Bayesian and they are likely to end up being losers in the game of science because they will often conclude in favor of an effect size simply because they randomly picked it from a wide range of effect sizes.

################################################################
# R-Code to compute Ratio of p(H0)/p(H1) and BF required to change p(H0/D)/p(H1/D) to a # ratio of 9:1 (90% probability that H0 is true).
################################################################

# set the scaling factor
scale = 1
# set the number of units / precision
precision = 5
# set upper limit of effect sizes
high = 3
# get lower limit
low = -high
# create effect sizes
x = seq(low,high,1/precision)
# compute number of effect sizes
N.es = length(x)
# get densities for each effect size
y = dcauchy(x,0,scale)
# draw pretty picture
curve(dcauchy(x,0,scale),low,high,xlab=’Effect Size’,main=”Jeffrey’s Prior Distribution of Population Effect Sizes”)
segments(0,0,0,dcauchy(0,0,scale),col=’red’,lty=3)
# get the density for effect size of 0 (lazy way)
H0 = max(y) / sum(y)
# get the density of all other effect sizes
H1 = 1-H0
text(0,H0,paste0(‘Density = ‘,H0),pos=4)
# compute a priori ratio of H1 over H0
PR = H1/H0
# set belief strength for H0
PH0 = .90
# get Bayes-Factor in favor of H0
BF = -(PH0*PR)/(PH0-1)
BF

library(BayesFactor)

N = 0
while (try < BF) {
N = N + 50
try = 1/exp(ttest.tstat(t=0, n1=N, n2=N, rscale = scale)[[‘bf’]])
}
try
N

dec = 3
res = paste0(“If standardized mean differences (Cohen’s d) are measured in intervals of d = “,1/precision,” and are limited to effect sizes between “,low,” and “,high)
res = paste0(res,” there are “,N.es,” effect sizes. With a uniform prior, the chance of picking the correct effect size “)
res = paste0(res,”at random is p = 1/”,N.es,” = “,round(1/N.es,dec),”. With the Cauchy(x,0,1) distribution, the probability of H0 is “)
res = paste0(res,round(H0,dec),” and the probability of H1 is “,round(H1,dec),”. To obtain a probability of .90 in favor of H0, the data have to produce a Bayes Factor of “)
res = paste0(res,round(BF,dec), ” in favor of H0. It is then possible to accept the null-hypothesis that the effect size is “)
res = paste0(res,”0 +/- “,round(.5/precision,dec),”. “,N*2,” participants are needed in a between subject design with an observed effect size of 0 to produce this Bayes Factor.”)
print(res)

SBTT.Normal.SD50

Subjective Bayesian T-Test Code

########################################################

rm(list=ls()) #will remove ALL objects

##############################################################
Bayes-Factor Calculations for T-tests
##############################################################

#Start of Settings

### Give a title for results output
Results.Title = ‘Normal(x,0,.5) N = 100 BS-Design, Obs.ES = 0′

### Criterion for Inference in Favor of H0, BF (H1/H0)
BF.crit.H0 = 1/3

### Criterion for Inference in Favor of H1
#set z.crit.H1 to Infinity to use Bayes-Factor, BF(H1/H0)
BF.crit.H1 = 3
z.crit.H1 = Inf

### Set Number of Groups
gr = 2

### Set Total Sample size
N = 100

### Set observed effect size
### for between-subject designs and one sample designs this is Cohen’s d
### for within-subject designs this is dz
obs.es = 0

### Set the mode of the alternative hypothesis
alt.mode = 0

### Set the variability of the alternative hypothesis
alt.var = .5

### Set the shape of the distribution of population effect sizes
alt.dist = 2  #1 = Cauchy; 2 = Normal

### Set the lower bound of population effect sizes
### Set to zero if there is zero probability to observe effects with the opposite sign
low = -3

### Set the upper bound of population effect sizes
### For example, set to 1, if you think effect sizes greater than 1 SD are unlikely
high = 3

### set the precision of density estimation (bigger takes longer)
precision = 100

### set the graphic resolution (higher resolution takes longer)
graphic.resolution = 20

### set limit for non-central t-values
nct.limit = 100

################################
# End of Settings
################################

# compute degrees of freedom
df = (N – gr)

# get range of population effect sizes
pop.es=seq(low,high,(1/precision))

# compute sampling error
se = gr/sqrt(N)

# limit population effect sizes based on non-central t-values
pop.es = pop.es[pop.es/se >= -nct.limit & pop.es/se <= nct.limit]

# function to get weights for Cauchy or Normal Distributions
get.weights=function(pop.es,alt.dist,p) {
if (alt.dist == 1) w = dcauchy(pop.es,alt.mode,alt.var)
if (alt.dist == 2) w = dnorm(pop.es,alt.mode,alt.var)
sum(w)
# get the scaling factor to scale weights to 1*precision
#scale = sum(w)/precision
# scale weights
#w = w / scale
return(w)
}

# get weights for population effect sizes
weights = get.weights(pop.es,alt.dist,precision)

#Plot Alternative Hypothesis
Title=”Alternative Hypothesis”
ymax=max(max(weights)*1.2,1)
plot(pop.es,weights,type=’l’,ylim=c(0,ymax),xlab=”Population Effect Size”,ylab=”Density”,main=Title,col=’blue’,lwd=3)
abline(v=0,col=’red’)

#create observations for plotting of prediction distributions
obs = seq(low,high,1/graphic.resolution)

# Get distribution for observed effect size assuming H1
H1.dist = as.numeric(lapply(obs, function(x) sum(dt(x/se,df,pop.es/se) * weights)/precision))

#Get Distribution for observed effect sizes assuming H0
H0.dist = dt(obs/se,df,0)

#Compute Bayes-Factors for Prediction Distribution of H0 and H1
BFs = H1.dist/H0.dist

#Compute z-scores (strength of evidence against H0)
z = qnorm(pt(obs/se,df,log.p=TRUE),log.p=TRUE)

# Compute H1 error rate rate
BFpos = BFs
BFpos[z < 0] = Inf
if (z.crit.H1 == Inf) z.crit.H1 = abs(z[which(abs(BFpos-BF.crit.H1) == min(abs(BFpos-BF.crit.H1)))])
ncz = qnorm(pt(pop.es/se,df,log.p=TRUE),log.p=TRUE)
weighted.power = sum(pnorm(abs(ncz),z.crit.H1)*weights)/sum(weights)
H1.error = 1-weighted.power

#Compute H0 Error Rate
z.crit.H0 = abs(z[which(abs(BFpos-BF.crit.H0) == min(abs(BFpos-BF.crit.H0)))])
H0.error = (1-pnorm(z.crit.H0))*2

# Get density for observed effect size assuming H0
Density.Obs.H0 = dt(obs.es,df,0)

# Get density for observed effect size assuming H1
Density.Obs.H1 = sum(dt(obs.es/se,df,pop.es/se) * weights)/precision

# Compute Bayes-Factor for observed effect size
BF.obs.es = Density.Obs.H1 / Density.Obs.H0

#Compute z-score for observed effect size
obs.z = qnorm(pt(obs.es/se,df,log.p=TRUE),log.p=TRUE)

#Show Results
ymax=max(H0.dist,H1.dist)*1.3
plot(type=’l’,z,H0.dist,ylim=c(0,ymax),xlab=”Strength of Evidence (z-value)”,ylab=”Density”,main=Results.Title,col=’black’,lwd=2)
par(new=TRUE)
plot(type=’l’,z,H1.dist,ylim=c(0,ymax),xlab=””,ylab=””,col=’blue’,lwd=2)
abline(v=obs.z,lty=2,lwd=2,col=’darkgreen’)
abline(v=-z.crit.H1,col=’blue’,lty=3)
abline(v=z.crit.H1,col=’blue’,lty=3)
abline(v=-z.crit.H0,col=’red’,lty=3)
abline(v=z.crit.H0,col=’red’,lty=3)
points(pch=19,c(obs.z,obs.z),c(Density.Obs.H0,Density.Obs.H1))
res = paste0(‘BF(H1/H0): ‘,format(round(BF.obs.es,3),nsmall=3))
text(min(z),ymax*.95,pos=4,res)
res = paste0(‘BF(H0/H1): ‘,format(round(1/BF.obs.es,3),nsmall=3))
text(min(z),ymax*.90,pos=4,res)
res = paste0(‘H1 Error Rate: ‘,format(round(H1.error,3),nsmall=3))
text(min(z),ymax*.80,pos=4,res)
res = paste0(‘H0 Error Rate: ‘,format(round(H0.error,3),nsmall=3))
text(min(z),ymax*.75,pos=4,res)

######################################################
### END OF Subjective Bayesian T-Test CODE
######################################################
### Thank you to Jeff Rouder for posting his code that got me started.
### http://jeffrouder.blogspot.ca/2016/01/what-priors-should-i-use-part-i.html

 

Bayes

Wagenmakers’ Default Prior is Inconsistent with the Observed Results in Psychologial Research

Bayesian statistics is like all other statistics. A bunch of numbers are entered into a formula and the end result is another number.  The meaning of the number depends on the meaning of the numbers that enter the formula and the formulas that are used to transform them.

The input for a Bayesian inference is no different than the input for other statistical tests.  The input is information about an observed effect size and sampling error. The observed effect size is a function of the unknown population effect size and the unknown bias introduced by sampling error in a particular study.

Based on this information, frequentists compute p-values and some Bayesians compute a Bayes-Factor. The Bayes Factor expresses how compatible an observed test statistic (e.g., a t-value) is with one of two hypothesis. Typically, the observed t-value is compared to a distribution of t-values under the assumption that H0 is true (the population effect size is 0 and t-values are expected to follow a t-distribution centered over 0 and an alternative hypothesis. The alternative hypothesis assumes that the effect size is in a range from -infinity to infinity, which of course is true. To make this a workable alternative hypothesis, H1 assigns weights to these effect sizes. Effect sizes with bigger weights are assumed to be more likely than effect sizes with smaller weights. A weight of 0 would mean a priori that these effects cannot occur.

As Bayes-Factors depend on the weights attached to effect sizes, it is also important to realize that the support for H0 depends on the probability that the prior distribution was a reasonable distribution of probable effect sizes. It is always possible to get a Bayes-Factor that supports H0 with an unreasonable prior.  For example, an alternative hypothesis that assumes that an effect size is at least two standard deviations away from 0 will not be favored by data with an effect size of d = .5, and the BF will correctly favor H0 over this improbable alternative hypothesis.  This finding would not imply that the null-hypothesis is true. It only shows that the null-hypothesis is more compatible with the observed result than the alternative hypothesis. Thus, it is always necessary to specify and consider the nature of the alternative hypothesis to interpret Bayes-Factors.

Although the a priori probabilities of  H0 and H1 are both unknown, it is possible to test the plausibility of priors against actual data.  The reason is that observed effect sizes provide information about the plausible range of effect sizes. If most observed effect sizes are less than 1 standard deviation, it is not possible that most population effect sizes are greater than 1 standard deviation.  The reason is that sampling error is random and will lead to overestimation and underestimation of population effect sizes. Thus, if there were many population effect sizes greater than 1, one would also see many observed effect sizes greater than 1.

To my knowledge, proponents of Bayes-Factors have not attempted to validate their priors against actual data. This is especially problematic when priors are presented as defaults that require no further justification for a specification of H1.

In this post, I focus on Wagenmakers’ prior because Wagenmaker has been a prominent advocate of Bayes-Factors as an alternative approach to conventional null-hypothesis-significance testing.  Wagenmakers’ prior is a Cauchy distribution with a scaling factor of 1.  This scaling factor implies a 50% probability that effect sizes are larger than 1 standard deviation.  This prior was used to argue that Bem’s (2011) evidence for PSI was weak. It has also been used in many other articles to suggest that the data favor the null-hypothesis.  These articles fail to point out that the interpretation of Bayes-Factors in favor of H0 is only valid for Wagenmakers’ prior. A different prior could have produced different conclusions.  Thus, it is necessary to examine whether Wagenmakers’ prior is a plausible prior for psychological science.

Wagenmakers’ Prior and Replicability

A prior distribution of effect sizes makes assumption about population effect sizes. In combination with information about sample size, it is possible to compute non-centrality parameters, which are equivalent to the population effect size divided by sampling error.  For each non-centrality parameter it is possible to estimate power as the area under the curve of the non-central t-distribution on the right side of the criterion value that corresponds to alpha, typically .05 (two-tailed).   The assumed typical power is simply the weighted average of the power values for each non-centrality parameters.

Replicability is not identical to power for a set of studies with heterogeneous non-centrality parameters because studies with higher power are more likely to become significant. Thus, the set of studies that achieved significance has higher average power as the original set of studies.

Aside from power, the distribution of observed test statistics is also informative. Unlikely power which is bound at 1, the distribution of test-statistics is unlimited. Thus, unreasonable assumptions about the distribution of effect sizes are visible in a distribution of test statistics that does not match distributions of tests statistics in actual studies.  One problem is that test-statistics are not directly comparable for different sample sizes or statistical tests because non-central distributions vary as a function of degrees of freedom and the test being used (e.g., chi-square vs. t-test).  To solve this problem, it is possible to convert all test statistics into z-scores so that they are on a common metric.  In a heterogeneous set of studies, the sign of the effect provides no useful information because signs only have to be consistent in tests of the same population effect size. As a result, it is necessary to use absolute z-scores. These absolute z-scores can be interpreted as the strength of evidence against the null-hypothesis.

I used a sample size of N = 80 and assumed a between subject design. In this case, sampling error is defined as 2/sqrt(80) = .224.  A sample size of N = 80 is the median sample size in Psychological Science. It is also the total sample size that would be obtained in a 2 x 2 ANOVA with n = 20 per cell.  Power and replicability estimates would increase for within-subject designs and for studies with larger N. Between subject designs with smaller N would yield lower estimates.

I simulated effect sizes in the range from 0 to 4 standard deviations.  Effect sizes of 4 or larger are extremely rare. Excluding these extreme values means that power estimates underestimate power slightly, but the effect is negligible because Wagenmakers’ prior assigns low probabilities (weights) to these effect sizes.

For each possible effect size in the range from 0 to 4 (using a resolution of d = .001)  I computed the non-centrality parameter as d/se.  With N = 80, these non-centrality parameters define a non-central t-distribution with 78 degrees of freedom.

I computed the implied power to achieve a significant result with alpha = .05 (two-tailed) with the formula

power = pt(ncp,N-2,qt(1-.025,N-2))

The formula returns the area under the curve on the right side of the criterion value that corresponds to a two-tailed test with p = .05.

The mean of these power values is the average power of studies if all effect sizes were equally likely.  The value is 89%. This implies that in the long run, a random sample of studies drawn from this population of effect sizes is expected to produce 89% significant results.

However, Wagenmakers’ prior assumes that smaller effect sizes are more likely than larger effect sizes. Thus, it is necessary to compute the weighted average of power using Wagenmakes’ prior distribution as weights.  The weights were obtained using the density of a Cauchy distribution with a scaling factor of 1 for each effect size.

wagenmakers.weights = dcauchy(es,0,1)

The weighted average power was computed as the sum of the weighted power estimates divided by the sum of weights.  The weighted average power is 69%.  This estimate implies that Wagenmakers’ prior assumes that 69% of statistical tests produce a significant result, when the null-hypothesis is false.

Replicability is always higher than power because the subset of studies that produce a significant result has higher average power than the the full set of studies. Replicabilty for a set of studies with heterogeneous power is the sum of the squared power of individual studies divided by the sum of power.

Replicability = sum(power^2) / sum(power)

The unweighted estimate of replicabilty is 96%.   To obtain the replicability for Wagenmakers’ prior, the same weighting scheme as for power can be used for replicability.

Wagenmakers.Replicability = sum(weights * power^2) / sum(weights*power)

The formula shows that Wagenmakers’ prior implies a replicabilty of 89%.  We see that the weighting scheme has relatively little effect on the estimate of replicability because many of the studies with small effect sizes are expected to produce a non-significant result, whereas the large effect sizes often have power close to 1, which implies that they wil be significant in the original study and the replication study.

The success rate of replication studies is difficult to estimate. Cohen estimated that typical studies in psychology have 50% power to detect a medium effect size, d = .5.  This would imply that the actual success rate would be lower because in an unknown percentage of studies the null-hypothesis is true.  However, replicability would be higher because studies with higher power are more likely to be significant.  Given this uncertainty, I used a scenario with 50% replicability.  That is an unbiased sample of studies taken from psychological journals would produce 50% successful replications in an exact replication study of the original studies.  The following computations show the implications of a 50% success rate in replication studies for the proportion of hypothesis tests where the null hypothesis is true, p(H0).

The percentage of true null-hypothesis is a function of the success rate in replication study, weighted average power, and weighted replicability.

p(H0) = (weighted.average.power * (weighted.replicability – success.rate)) / (success.rate*.05 – success.rate*weighted.average.power – .05^2 + weighted.average.power*weighted.replicability)

To produce a success rate of 50% in replication studies with Wagenmakers’ prior when H1 is true (89% replicability), the percentage of true null-hypothesis has to be 92%.

The high percentage of true null-hypothesis (92%) also has implications for the implied false-positive rate (i.e., the percentage of significant results that are true null-hypothesis.

False Positive Rate =  (Type.1.Error *.05)  / (Type.1.Error * .05 +
(1-Type.1.Error) * Weighted.Average.Power)
For every 100 studies, there are 92 true null-hypothesis that produce 92*.05 = 4.6 false positive results. For the remaining 8 studies with a true effect, there are 8 * .67 = 5.4 true discoveries.  The false positive rate is 4.6 / (4.6 + 5.4) = 46%.  This means Wagenmakers prior assumes that a success rate of 50% in replication studies implies that nearly 50% of studies that replicate successfully are false-positives results that would not replicate in future replication studies.

Aside from these analytically derived predictions about power and replicability, Wagenmakers’ prior also makes predictions about the distribution of observed evidence in individual studies. As observed scores are influenced by sampling error, I used simulations to illustrate the effect of Wagenmakers’ prior on observed test statistics.

For the simulation I converted the non-central t-values into non-central z-scores and simulated sampling error with a standard normal distribution.  The simulation included 92% true null-hypotheses and 8% true H1 based on Wagenmaker’s prior.  As published results suffer from publication bias, I simulated publication bias by selecting only observed absolute z-scores greater than 1.96, which corresponds to the p < .05 (two-tailed) significance criterion.  The simulated data were submitted to a powergraph analysis that estimates power and replicability based on the distribution of absolute z-scores.

Figure 1 shows the results.   First, the estimation method slightly underestimated the actual replicability of 50% by 2 percentage points.  Despite this slight estimation error, the Figure accurately illustrates the implications of Wagenmakers’ prior for observed distributions of absolute z-scores.  The density function shows a steep decrease in the range of z-scores between 2 and 3, and a gentle slope for z-scores greater than 4 to 10 (values greater than 10 are not shown).

Powergraphs provide some information about the composition of the total density by dividing the total density into densities for power less than 20%, 20-50%, 50% to 85% and more than 85%. The red line (power < 20%) mostly determines the shape of the total density function for z-scores from 2 to 2.5, and most the remaining density is due to studies with more than 85% power starting with z-scores around 4.   Studies with power in the range between 20% and 85% contribute very little to the total density. Thus, the plot correctly reveals that Wagenmakers’ prior assumes that the roughly 50% average replicability is mostly due to studies with very low power (< 20%) and studies with very high power (> 85%).
Powergraph for Wagenmakers' Prior (N = 80)

Validation Study 1: Michael Nujiten’s Statcheck Data

There are a number of datasets that can be used to evaluate Wagenmakers’ prior. The first dataset is based on an automatic extraction of test statistics from psychological journals. I used Michael Nuijten’s dataset to ensure that I did not cheery-pick data and to allow other researchers to reproduce the results.

The main problem with automatically extracted test statistics is that the dataset does not distinguish between  theoretically important test statistics and other statistics, such as significance tests of manipulation checks.  It is also not possible to distinguish between between-subject and within-subject designs.  As a result, replicability estimates for this dataset will be higher than the simulation based on a between-subject design.

Powergraph for Michele Nuijten's StatCheck Data

 

Figure 2 shows all of the data, but only significant z-scores (z > 1.96) are used to estimate replicability and power. The most striking difference between Figure 1 and Figure 2 is the shape of the total density on the right side of the significance criterion.  In Figure 2 the slope is shallower. The difference is visible in the decomposition of the total density into densities for different power bands.  In Figure 1 most of the total density was accounted for by studies with less than 20% power and studies with more than 85% power.  In Figure 2, studies with power in the range between 20% and 85% account for the majority of studies with z-scores greater than 2.5 up to z-scores of 4.5.

The difference between Figure 1 and Figure 2 has direct implications for the interpretation of Bayes-Factors with t-values that correspond to z-scores in the range of just significant results. Given Wagenmakers’ prior, z-scores in this range mostly represent false-positive results. However, the real dataset suggests that some of these z-scores are the result of underpowered studies and publication bias. That is, in these studies the null-hypothesis is false, but the significant result will not replicate because these studies have low power.

Validation Study 2:  Open Science Collective Articles (Original Results)

The second dataset is based on the Open Science Collective (OSC) replication project.  The project aimed to replicate studies published in three major psychology journals in the year 2008.  The final number of articles that were selected for replication was 99. The project replicated one study per article, but articles often contained multiple studies.  I computed absolute z-scores for theoretically important tests from all studies of these 99 articles.  This analysis produced 294 test statistics that could be converted into absolute z-scores.

Powergraph for OSC Rep.Project Articles (all studies)
Figure 3 shows clear evidence of publication bias.  No sampling distribution can produce the steep increase in tests around the critical value for significance. This selection is not an artifact of my extraction, but an actual feature of published results in psychological journals (Sterling, 1959).

Given the small number of studies, the figure also contains bootstrapped 95% confidence intervals.  The 95% CI for the power estimate shows that the sample is too small to estimate power for all studies, including studies in the proverbial file drawer, based on the subset of studies that were published. However, the replicability estimate of 49% has a reasonably tight confidence interval ranging from 45% to 66%.

The shape of the density distribution in Figure 3 differs from the distribution in Figure 2 in two ways. Initially the slop is steeper in Figure 3, and there is less density in the tail with high z-scores.  Both aspects contribute to the lower estimate of replicability in Figure 3, suggesting that replicabilty of focal hypothesis tests is lower than replicabilty for all statistical tests.

Comparing Figure 3 and Figure 1 shows again that the powergraph based on Wagenmakers’ prior differs from the powergraph for real data. In this case, the discrepancy is even more notable because focal hypothesis tests rarely produce large z-scores (z > 6).

Validation Study 3:  Open Science Collective Articles (Replication Results)

At present, the only data that are somewhat representative of psychological research (at least of social and cognitive psychology) and that do not suffer from publication bias are the results from the replication studies of the OSC replication project.  Out of 97 significant results in original studies, 36 studies (37%) produced that produced a significant result in the original studies produced a significant result in the replication study.  After eliminating some replication studies (e.g., sample of replication study was considerably smaller), 88 studies remained.

Powergraph for OSC Replication Results (k = 88)Figure 4 shows the powergraph for the 88 studies. As there is no publication bias, estimates of power and replicability are based on non-significant and significant results.  Although the sample size is smaller, the estimate of power has a reasonably narrow confidence interval because the estimate includes non-significant results. Estimated power is only 31%. The 95% confidence interval includes the actual success rate of 40%, which shows that there is no evidence of publication bias.

A visual comparison of Figure 1 and Figure 4 shows again that real data diverge from the predicted pattern by Wagenmakers’ prior.  Real data show a greater contribution of power in the range between 20% and 85% to the total density, and large z-scores (z > 6) are relatively rare in real data.

Conclusion

Statisticians have noted that it is good practice to examine the assumptions underlying statistical tests. This blog post critically examines the assumptions underlying the use of Bayes-Factors with Wagenmakers’ prior.  The main finding is that Wagenmaker’s prior makes unreasonable assumptions about power, replicability, and the distribution of observed test-statistics with or without publication bias. The main problem from Wagenmakers’ prior is that it predicts too many statistical results with strong evidence against the null-hypothesis (z > 5, or the 5 sigma rule in physics).  To achieve reasonable predictions for success rates without publication bias (~50%), Wagenmakers’ prior has to assume that over 90% of statistical tests conducted in psychology test false hypothesis (i.e., predict an effect when H0 is true), and that the false-positive rate is close to 50%.

Implications

Bayesian statisticians have pointed out for a long time that the choice of a prior influences Bayes-Factors (Kass, 1993, p. 554).  It is therefore useful to carefully examine priors to assess the effect of priors on Bayesian inferences. Unreasonable priors will lead to unreasonable inferences.  This is also true for Wagenmakers’ prior.

The problem of using Bayes-Factors with Wagenmakers’ prior to test the null-hypothesis is apparent in a realistic scenario that assumes a moderate population effect size of d = .5 and a sample size of N = 80 in a between subject design. This study has a non-central t of 2.24 and 60% power to produce a significant result with p < .05, two-tailed.   I used R to simulate 10,000 test-statistics using the non-central t-distribution and then computed Bayes-Factors with Wagenmakers’ prior.

Figure 5 shows a histogram of log(BF). The log is being used because BF are ratios and have very skewed distributions.  The histogram shows that BF never favor the null-hypothesis with a BF of 10 in favor of H0 (1/10 in the histogram).  The reason is that even with Wagenmakers’ prior a sample size of N = 80 is too small to provide strong support for the null-hypothesis.  However, 21% of observed test statistics produce a Bayes-Factor less than 1/3, which is sometimes used as sufficient evidence to claim that the data support the null-hypothesis.  This means that the test has a 21% error rate to provide evidence for the null-hypothesis when the null-hypothesis is false.  A 21% error rate is 4 times larger than the 5% error rate in null-hypothesis significance testing. It is not clear why researchers should replace a statistical method with a 5% error rate for a false discovery of an effect with a 20% error rate of false discoveries of null effects.

Another 48% of the results produce Bayes-Factors that are considered inconclusive. This leaves 31% of results that favor H1 with a Bayes-Factor greater than 3, and only 17% of results produce a Bayes-Factor greater than 10.   This implies that even with the low standard of a BF > 3, the test has only 31% power to provide evidence for an effect that is present.

These results are not wrong because they correctly express the support that the observed data provide for H0 and H1.  The problem only occurs when the specification of H1 is ignored. Given Wagenmakers prior, it is much more likely that a t-value of 1 stems from the sampling distribution of H0 than from the sampling distribution of H1.  However, studies with 50% power when an effect is present are also much more likely to produce t-values of 1 than t-values of 6 or larger.   Thus, a different prior that is more consistent with the actual power of studies in psychology would produce different Bayes-Factors and reduce the percentage of false discoveries of null effects.  Thus, researchers who think Wagenmakers’ prior is not a realistic prior for their research domain should use a more suitable prior for their research domain.

HistogramBF

 

Counterarguments

Wagenmakers’ has ignored previous criticisms of his prior.  It is therefore not clear what counterarguments he would make.  Below, I raise some potential counterarguments that might be used to defend the use of Wagenmakers’ prior.

One counterargument could be that the prior is not very important because the influence of priors on Bayes-Factors decreases as sample sizes increase.  However, this argument ignores the fact that Bayes-Factors are often used to draw inferences from small samples. In addition, Kass (1993) pointed out that “a simple asymptotic analysis shows that even in large samples Bayes factors remain sensitive to the choice of prior” (p. 555).

Another counterargument could be that a bias in favor of H0 is desirable because it keeps the rate of false-positives low. The problem with this argument is that Bayesian statistics does not provide information about false-positive rates.  Moreover, the cost for reducing false-positives is an increase in the rate of false negatives; that is, either inconclusive results or false evidence for H0 when an effect is actually present.  Finally, the choice of the correct prior will minimize the overall amount of errors.  Thus, it should be desirable for researchers interested in Bayesian statistics to find the most appropriate priors in order to minimize the rate of false inferences.

A third counterargument could be that Wagenmakers’ prior expresses a state of maximum uncertainty, which can be considered a reasonable default when no data are available.  If one considers each study as a unique study, a default prior of maximum uncertainty would be a reasonable starting point.  In contrast, it may be questionable to treat a new study as a randomly drawn study from a sample of studies with different population effect sizes.  However, Wagenmakers’ prior does not express a state of maximum uncertainty and makes assumptions about the probability of observing very large effect sizes.  It does so without any justification for this expectation.  It therefore seems more reasonable to construct priors that are consistent with past studies and to evaluate priors against actual results of studies.

A fourth counterargument is that Bayes-Factors are superior because they can provide evidence for the null-hypothesis and the alternative hypothesis.  However, this is not correct. Bayes-Factors only provide relative support for the null-hypothesis relative to a specific alternative hypothesis.  Researchers who are interested in testing the null-hypothesis can do so using parameter estimation with confidence or credibility intervals. If the interval falls within a specified region around zero, it is possible to affirm the null-hypothesis with a specified level of certainty that is determined by the precision of the study to estimate the population effect size.  Thus, it is not necessary to use Bayes-Factors to test the null-hypothesis.

In conclusion, Bayesian statistics and other statistics are not right or wrong. They combine assumptions and data to draw inferences.  Untrustworthy data and wrong assumptions can lead to false conclusions.  It is therefore important to test the integrity of data (e.g., presence of publication bias) and to examine assumptions.  The uncritical use of Bayes-Factors with default assumptions is not good scientific practice and can lead to false conclusions just like the uncritical use of p-values can lead to false conclusions.

A comparison of The Test of Excessive Significance and the Incredibility Index

A comparison of The Test of Excessive Significance and the Incredibility Index

It has been known for decades that published research articles report too many significant results (Sterling, 1959).  This phenomenon is called publication bias.  Publication bias has many negative effects on scientific progress and undermines the value of meta-analysis as a tool to accumulate evidence from separate original studies.

Not surprisingly, statisticians have tried to develop statistical tests of publication bias.  The most prominent tests are funnel plots (Light & Pillemer, 1984) and Eggert regression (Eggert et al., 1997). Both tests rely on the fact that population effect sizes are statistically independent of sample sizes.  As a result, observed effect sizes in a representative set of studies should also be independent of sample size.  However, publication bias will introduce a negative correlation between observed effect sizes and sample sizes because larger effects are needed in smaller studies to produce a significant result.  The main problem with these bias tests is that other factors may produce heterogeneity in population effect sizes that can also produce variation in observed effect sizes and the variation in population effect sizes may be related to sample sizes.  In fact, one would expect a correlation between population effect sizes and sample sizes if researchers use power analysis to plan their sample sizes.  A power analysis would suggest that researchers use larger samples to study smaller effects and smaller samples to study large effects.  This makes it problematic to draw strong inferences from negative correlations between effect sizes and sample sizes about the presence of publication bias.

Sterling et al. (1995) proposed a test for publication bias that does not have this limitation.  The test is based on the fact that power is defined as the relative frequency of significant results that one would expect from a series of exact replication studies.  If a study has 50% power, the expected frequency of significant results in 100 replication studies is 50 studies.  Publication bias will lead to an inflation in the percentage of significant results. If only significant results are published, the percentage of significant results in journals will be 100%, even if studies had only 50% power to produce significant results.  Sterling et al. (1995) found that several journals reported over 90% of significant results. Based on some conservative estimates of power, he concluded that this high success rate can only be explained with publication bias.  Sterling et al. (1995), however, did not develop a method that would make it possible to estimate power.

Ioannidis and Trikalonis (2007) proposed the first test for publication bias based on power analysis.  They call it “An exploratory test for an excess of significant results.” (ETESR). They do not reference Sterling et al. (1995), suggesting that they independently rediscovered the usefulness of power analysis to examine publication bias.  The main problem for any bias test is to obtain an estimate of (true) power. As power depends on population effect sizes, and population effect sizes are unknown, power can only be estimated.  ETSESR uses a meta-analysis of effect sizes for this purpose.

This approach makes a strong assumption that is clearly stated by Ioannidis and Trikalonis (2007).  The test works well “If it can be safely assumed that the effect is the same in all studies on the same question” (p. 246). In other words, the test may not work well when effect sizes are heterogeneous.  Again, the authors are careful to point out this limitation of ETSER. “In the presence of considerable between-study heterogeneity, efforts should be made first to dissect sources of heterogeneity [33,34]. Applying the test ignoring genuine heterogeneity is ill-advised” (p. 246).

The authors repeat this limitation at the end of the article. “Caution is warranted when there is genuine between-study heterogeneity. Test of publication bias generally yield spurious results in this setting.” (p. 252).   Given these limitations, it would be desirable to develop a test that that does not have to assume that all studies have the same population effect size.

In 2012, I developed the Incredibilty Index (Schimmack, 2012).  The name of the test is based on the observation that it becomes increasingly likely that a set of studies produces a non-significant result as the number of studies increases.  For example, if studies have 50% power (Cohen, 1962), the chance of obtaining a significant result is equivalent to a coin flip.  Most people will immediately recognize that it becomes increasingly unlikely that a fair coin will produce the same outcome again and again and again.  Probability theory shows that this outcome becomes very unlikely even after just a few coin tosses as the cumulative probability decreases exponentially from 50% to 25% to 12.5%, 6.25%, 3.1.25% and so on.  Given standard criteria of improbability (less than 5%), a series of 5 significant results would be incredible and sufficient to be suspicious that the coin is fair, especially if it always falls on the side that benefits the person who is throwing the coin. As Sterling et al. (1995) demonstrated, the coin tends to favor researchers’ hypothesis at least 90% of the time.  Eight studies are sufficient to show that even a success rate of 90% is improbable (p < .05).  It therefore very easy to show that publication bias contributes to the incredible success rate in journals, but it is also possible to do so for smaller sets of studies.

To avoid the requirement of a fixed effect size, the incredibility index computes observed power for individual studies. This approach avoids the need to aggregate effect sizes across studies. The problem with this approach is that observed power of a single study is a very unreliable measure of power (Yuan & Maxwell, 2006).  However, as always, the estimate of power becomes more precise when power estimates of individual studies are combined.  The original incredibility indices used the mean to estimate averaged power, but Yuan and Maxwell (2006) demonstrated that the mean of observed power is a biased estimate of average (true) power.  In further developments of my method, I changed the method and I am now using median observed power (Schimmack, 2016).  The median of observed power is an unbiased estimator of power (Schimmack, 2015).

In conclusion, the Incredibility Index and the Exploratory Test for an Excess of Significant Results are similar tests, but they differ in one important aspect.  ETESR is designed for meta-analysis of highly similar studies with a fixed population effect size.  When this condition is met, ETESR can be used to examine publication bias.  However, when this condition is violated and effect sizes are heterogeneous, the incredibility index is a superior method to examine publication bias. At present, the Incredibility Index is the only test for publication bias that does not assume a fixed population effect size, which makes it the ideal test for publication bias in heterogeneous sets of studies.

References

Light, J., Pillemer, D. B.  (1984). Summing up: The Science of Reviewing Research. Cambridge, Massachusetts.: Harvard University Press.

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test”. BMJ 315 (7109): 629–634. doi:10.1136/bmj.315.7109.629.

Ioannidis and Trikalinos (2007).  An exploratory test for an excess of significant findings. Clinical Trials, 4 245-253.

Schimmack (2012). The Ironic effect of significant results on the credibility of multiple study articles. Psychological Methods, 17, 551-566.

Schimmack, U. (2016). A revised introduction o the R-Index.

Schimmack, U. (2015). Meta-analysis of observed power.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Stering, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa, The American Statistician, 49, 108-112.

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167