Why Psychologists Should Not Change The Way They Analyze Their Data: The Devil is in the Default Prior

The scientific method is well-equipped to demonstrate regularities in nature as well as human behaviors. It works by repeating a scientific procedure (experiment or natural observation) many times. In the absence of a regular pattern, the empirical data will follow a random pattern. When a systematic pattern exists, the data will deviate from the pattern predicted by randomness. The deviation of an observed empirical result from a predicted random pattern is often quantified as a probability (p-value). The p-value itself is based on the ratio of the observed deviation from zero (effect size) and the amount of random error. As the signal-to-noise ratio increases, it becomes increasingly unlikely that the observed effect is simply a random event. As a result, it becomes more likely that an effect is present. The amount of noise in a set of observations can be reduced by repeating the scientific procedure many times. As the number of observations increases, noise decreases. For strong effects (large deviations from randomness), a relative small number of observations can be sufficient to produce extremely low p-values. However, for small effects it may require rather large samples to obtain a high signal-to-noise ratio that produces a very small p-value. This makes it difficult to test the null-hypothesis that there is no effect. The reason is that it is always possible to find an effect size that is so small that the noise in a study is too large to determine whether a small effect is present or whether there is really no effect at all; that is, the effect size is exactly zero (1 / infinity).

The problem that it is impossible to demonstrate scientifically that an effect is absent may explain why the scientific method has been unable to resolve conflicting views around controversial topics such as the existence of parapsychological phenomena or homeopathic medicine that lack a scientific explanation, but are believed by many to be real phenomena. The scientific method could show that these phenomena are real, if they were real, but the lack of evidence for these effects cannot rule out the possibility that a small effect may exist. In this post, I explore two statistical solutions to the problem of demonstrating that an effect is absent.

Neyman-Pearson Significance Testing (NPST)

The first solution is to follow Neyman-Pearsons’s orthodox significance test. NPST differs from the widely practiced null-hypothesis significance test (NHST) in that non-significant results are interpreted as evidence for the null-hypothesis. Thus, using the standard criterion of p = .05 as the criterion for significance, a p-value below .05 is used to reject the null-hypothesis and to infer that an effect is present. Importantly, if the p-value is greater than .05 the results are used to accept the null-hypothesis; that is, the hypothesis that there is no effect is true. As all statistical inferences, it is possible that the evidence is misleading and leads to the wrong conclusion. NPST distinguishes between two types or errors that are called type-I and type-II error. Type-I errors are errors when a p-value is below the criterion value (p < .05), but the null-hypothesis is actually true; that is there is no effect and the observed effect size was caused by a rare random event. Type-II errors are made when the null-hypothesis is accepted, but the null-hypothesis is false; there actually is an effect. The probability of making a type-II error depends on the size of the effect and the amount of noise in the data. Strong effects are unlikely to produce a type-II error even with noise data. Studies with very little noise are also unlikely to produce type-II errors because even small effects can still produce a high signal-to-noise ratio and significant results (p-values below the criterion value).   Type-II error rates can be very high in studies with small effects and a large amount of noise. NPST makes it possible to quantify the probability of a type-II error for a given effect size. By investing a large amount of resources, it is possible to reduce noise to a level that is sufficient to have a very low type-II error probability for very small effect sizes. The only requirement for using NPST to provide evidence for the null-hypothesis is to determine a margin of error that is considered acceptable. For example, it may be acceptable to infer that a weight-loss-medication has no effect on weight if weight loss is less than 1 pound over a one month period. It is impossible to demonstrate that the medication has absolutely no effect, but it is possible to demonstrate with high probability that the effect is unlikely to be more than 1 pound.

Bayes-Factors

The main difference between Bayes-Factors and NPST is that NPST yields type-II error rates for an a priori effect size. In contrast, Bayes-Factors do not postulate a single effect size, but use an a priori distribution of effect sizes. Bayes-Factors are based on the probability that the observed effect sizes is based on a true effect size of zero relative to the probability that the observed effect size was based on a true effect size within a range of a priori effect sizes. Bayes-Factors are the ratio of the probabilities for the two hypotheses. It is arbitrary, which hypothesis is in the numerator and which hypothesis is in the denominator. When the null-hypothesis is placed in the numerator and the alternative hypothesis is placed in the denominator, Bayes-Factors (BF01) decrease towards zero the more the data suggest that an effect is present. In this way, Bayes-Factors behave very much like p-values. As the signal-to-noise ratio increases, p-values and BF01 decrease.

There are two practical problems in the use of Bayes-Factors. One problem is that Bayes-Factors depend on the specification of the a priori distribution of effect sizes. It is therefore important that results can never be interpreted as evidence for the null-hypothesis or against the null-hypothesis per se. A Bayes-Factor that favors the null-hypothesis in the comparison to one a priori distribution can favor the alternative hypothesis for another a priori distribution of effect sizes. This makes Bayes-Factors impractical for the purpose of demonstrating that an effect does not exist (e.g., a drug does not have positive treatment effects). The second problem is that Bayes-Factors only provide quantitative information about the two hypotheses. Without a clear criterion value, Bayes-Factors cannot be used to claim that an effect is present or absent.

Selecting a Criterion Value for Bayes-Factors

A number of criterion values seem plausible. NPST always leads to a decision depending on the criterion for p-values. An equivalent criterion value for Bayes-Factors would be a value of 1. Values greater than 1 favor the null-hypothesis over the alternative, whereas values less than 1 favor the alternative hypothesis. This criterion avoids inconclusive results. The disadvantage with this criterion is that Bayes-Factors close to 1 are very variable and prone to have high type-I and type-II error rates. To avoid this problem, it is possible to use more stringent criterion values. This reduces the type-I and type-II error rates, but it also increases the rate of inconclusive results in noisy studies. Bayes-Factors of 3 (a 3 to 1 ratio in favor of the null over an alternative hypothesis) are often used to suggest that the data favor one hypothesis over another, and Bayes-Factors of 10 or more are often considered strong support. One problem with these criterion values is that there have been no systematic studies of the type-I and type-II error rates for these criterion values. Moreover, there have been no systematic sensitivity studies; that is, the ability of studies to reach a criterion value for different signal-to-noise ratios.

Wagenmakers et al. (2011) argued that p-values can be misleading and that Bayes-Factors provide more meaningful results. To make their point, they investigated Bem’s (2011) controversial studies that seemed to demonstrate the ability to anticipate random events in the future (time –reversed causality). Using a significance criterion of p < .05 (one-tailed), 9 out of 10 studies showed evidence of an effect. For example, in Study 1, participants were able to predict the location of erotic pictures 54% of the time, even before a computer randomly generated the location of the picture. Using a more liberal type-I error rate of p < .10 (one-tailed), all 10 studies produced evidence for extrasensory perception.

Wagenmakers et al. (2011) re-examined the data with Bayes-Factors. They used a Bayes-Factor of 3 as the criterion value. Using this value, six tests were inconclusive, three provided substantial support for the null-hypothesis (the observed effect was just due to noise in the data) and only one test produced substantial support for ESP.   The most important point here is that the authors interpreted their results using a Bayes-Factor of 3 as criterion. If they had used a Bayes-Factor of 10 as criterion, they would have concluded that all studies were inconclusive. If they had used a Bayes-Factor of 1 as criterion, they would have concluded that 6 studies favored the null-hypothesis and 4 studies favored the presence of an effect.

Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers used Bayes-Factors in a design with optional stopping. They agreed to stop data-collection when the Bayes-Factor reached a criterion value of 10 in favor of either hypothesis. The implementation of a decision to stop data collection suggests that a Bayes-Factor of 10 was considered decisive. One reason for this stopping rule would be that it is extremely unlikely that a Bayes-Factor might swing to favoring the alternative hypothesis if more data were collected. By the same logic, a Bayes-Factor of 10 that favors the presence of an effect in an ESP effect would suggest that further data collection would be unnecessary because the evidence already shows rather strong evidence that an effect is present.

Tan, Dienes, Jansari, and Goh, (2014) report a Bayes-Factor of 11.67 and interpret as being “greater than 3 and strong evidence for the alternative over the null” (p. 19). Armstrong and Dienes (2013) report a Bayes-Factor of 0.87 and state that no conclusion follows from this finding because the Bayes-Factor is between 3 and 1/3. This statement implies that Bayes-Factors that meet the criterion value are conclusive.

In sum, a criterion-value of 3 has often been used to interpret empirical data and a criterion of 10 has been used as strong evidence in favor of an effect or in favor of the null-hypothesis.

Meta-Analysis of Multiple Studies

As sample sizes increase, noise decreases and the signal-to-noise ratio increases. Rather than increasing the sample size of a single study, it is also possible to conduct multiple smaller studies and to combine the evidence of studies in a meta-analysis. The effect is the same. A meta-analysis based on several original studies reduces random noise in the data and can produce higher signal-to-noise ratios when an effect is present. On the flip side, a low signal-to-noise ratio in a meta-analysis implies that the signal is very weak and that the true effect size is close to zero. As the evidence in a meta-analysis is based on the aggregation of several smaller studies, the results should be consistent. That is, the effect size in the smaller studies and the meta-analysis is the same. The only difference is that aggregation of studies reduces noise, which increases the signal-to-noise ratio.   A meta-analysis therefore can highlight the problem of interpreting a low signal-to-noise ratio (BF10 < 1, p > .05) in small studies as evidence for the null-hypothesis. In NPST this result would be flagged as not trustworthy because the type-II error probability is high. For example, a non-significant result with a type-II error of 80% (20% power) is not particularly interesting and nobody would want to accept the null-hypothesis with such a high error probability. Holding the effect size constant, the type-II error probability decreases as the number of studies in a meta-analysis increases and it becomes increasingly more probable that the true effect size is below the value that was considered necessary to demonstrate an effect. Similarly, Bayes-Factors can be misleading in small samples and they become more conclusive as more information becomes available.

A simple demonstration of the influence of sample size on Bayes-Factors comes from Rouder and Morey (2011). The authors point out that it is not possible to combine Bayes-Factors by multiplying Bayes-Factors of individual studies. To address this problem, they created a new method to combine Bayes-Factors. This Bayesian meta-analysis is implemented in the Bayes-Factor r-package. Rouder and Morey (2011) applied their method to a subset of Bem’s data. However, they did not use it to examine the combined Bayes-Factor for the 10 studies that Wagenmakers et al. (2011) examined individually. I submitted the t-values and sample sizes of all 10 studies to a Bayesian meta-analysis and obtained a strong Bayes-Factor in favor of an effect, BF10 = 16e7, that is, 16 million to 1 in favor of ESP. Thus, a meta-analysis of all 10 studies strongly suggests that Bem’s data are not random.

Another way to meta-analyze Bem’s 10 studies is to compute a Bayes-Factor based on the finding that 9 out of 10 studies produced a significant result. The p-value for this outcome under the null-hypothesis is extremely small; 1.86e-11, that is p < .00000000002. It is also possible to compute a Bayes-Factor for the binomial probability of 9 out of 10 successes with a probability of 5% to have a success under the null-hypothesis. The alternative hypothesis can be specified in several ways, but one common option is to use a uniform distribution from 0 to 1 (beta(1,1). This distribution allows for the power of a study to range anywhere from 0 to 1 and makes no a priori assumptions about the true power of Bem’s studies. The Bayes-Factor strongly favors the presence of an effect, BF10 = 20e9. In sum, a meta-analysis of Bem’s 10 studies strongly supports the presence of an effect and rejects the null-hypothesis.

The meta-analytic results raise concerns about the validity of Wagenmakers et al.’s (2011) claim that Bem presented weak evidence and that p-values misleading information. Instead, Wagenmakers et al.’s Bayes-Factors are misleading and fail to detect an effect that is clearly present in the data.

The Devil is in the Priors: What is the Alternative Hypothesis in the Default Bayesian t-test?

Wagenmakers et al. (2011) computed Bayes-Factors using the default Bayesian t-test. The default Bayesian t-test uses a Cauchy distribution centered over zero as the alternative hypothesis. The Cauchy distribution has a scaling factor. Wagenmakers et al. (2011) used a default scaling factor of 1. Since then, the default scaling parameter has changed to .707.Figure 1 illustrates Cauchi distributions with scaling factors .2, .5, .707, and 1.

WagF1

The black line shows the Cauchy distribution with a scaling factor of d = .2. A scaling factor of d = .2 implies that 50% of the density of the distribution is in the interval between d = -.2 and d = .2. As the Cauchy-distribution is centered over 0, this specification also implies that the null-hypothesis is considered much more likely than many other effect sizes, but it gives equal weight to effect sizes below and above an absolute value of d = .2.   As the scaling factor increases, the distribution gets wider. With a scaling factor of 1, 50% of the density distribution is within the range from -1 to 1 and 50% covers effect sizes greater than 1.   The choice of the scaling parameter has predictable consequences on the Bayes-Factor. As long as the true effect size is more extreme than the scaling parameter, Bayes-Factors will favor the alternative hypothesis and Bayes-Factors will increase towards infinity as sampling error decreases. However, for true effect sizes that are below the scaling parameter, Bayes-Factors may initially favor the null-hypothesis because the alternative hypothesis includes effect sizes that are more extreme than the alternative hypothesis. As sample sizes increase, the Bayes-Factor will change from favoring the null-hypothesis to favoring the alternative hypothesis.   This can explain why Wagenmakers et al. (2011) found no support for ESP when Bem’s studies were examined individually, but a meta-analysis of all studies shows strong evidence in favor of an effect.

The effect of the scaling parameter on Bayes-Factors is illustrated in the following Figure.

WagF2

The straight lines show Bayes-Factors (y-axis) as a function of sample size for a scaling parameter of 1. The black line shows Bayes-Factors favoring an effect of d = .2 when the effect size is actually d = .2 (BF10) and the red line shows Bayes-Factor favoring the null-hypothesis when the effect size is actually 0. The green line implies a criterion value of 3 to suggest “substantial” support for either hypothesis (Wagenmakers et al., 2011). The figure shows that Bem’s sample sizes of 50 to 150 participants could never produce substantial evidence for an effect when the observed effect size is d = .2. In contrast, an effect size of 0 would produce provide substantial support for the null-hypothesis. Of course, actual effect sizes in samples will deviated from these hypothetical values, but sampling error will average out. Thus, for studies that occasionally show support for an effect there will also be studies that underestimate support for an effect. The dotted lines illustrate how the choice of the scaling factor influences Bayes-Factors. With a scaling factor of d = .2, Bayes-Factors would never favor the null-hypothesis. They would also not support the alternative hypothesis in studies with less than 150 participants and even in these studies the Bayes-Factor is likely to be just above 3.

Figure 2 explains why Wagenmakers et al.’s (2011) did mainly find inconclusive results. On the one hand, the effect size was typically around d = .2. As a result, the Bayes-Factor did not provide clear support for the null-hypothesis. On the other hand, an effect size of d = .2 in studies with 80% power is insufficient to produce Bayes-Factors favoring the presence of an effect, when the alternative hypothesis is specified as a Cauchy distribution centered over 0. This is especially true when the scaling parameter is larger, but even for a seemingly small scaling parameter Bayes-Factors would not provide strong support for a small effect. The reason is that the alternative hypothesis is centered over 0. As a result, it is difficult to distinguish the null-hypothesis from the alternative hypothesis.

A True Alternative Hypothesis: Centering the Prior Distribution over a Non-Null Effect Size

A Cauchy-distribution is just one possible way to formulate an alternative hypothesis. It is also possible to formulate alternative hypothesis as (a) a uniform distribution of effect sizes in a fixed range (e.g., the effect size is probably small to moderate, d = .2 to .5) or as a normal distribution centered over an effect size (e.g., the effect is most likely to be small, but there is some uncertainty about how small, d = 2 +/- SD = .1) (Dienes, 2014).

Dienes provided an online app to compute Bayes-Factors for these prior distributions. I used the posted r-code by John Christie to create the following figure. It shows Bayes-Factors for three a priori uniform distributions. Solid lines show Bayes-Factors for effect sizes in the range from 0 to 1. Dotted lines show effect sizes in the range from 0 to .5. The dot-line pattern shows Bayes-Factors for effect sizes in the range from .1 to .3. The most noteworthy observation is that prior distributions that are not centered over zero can actually provide evidence for a small effect with Bem’s (2011) sample sizes. The second observation is that these priors can also favor the null-hypothesis when the true effect size is zero (red lines). Bayes-Factors become more conclusive for more precisely formulate alternative hypotheses. The strongest evidence is obtained by contrasting the null-hypothesis with a narrow interval of possible effect sizes in the .1 to .3 range. The reason is that in this comparison weak effects below .1 clearly favor the null-hypothesis. For an expected effect size of d = .2, a range of values from 0 to .5 seems reasonable and can produce Bayes-Factors that exceed a value of 3 in studies with 100 to 200 participants. Thus, this is a reasonable prior for Bem’s studies.

WagF3

It is also possible to formulate alternative hypotheses with normal distributions around an a priori effect size. Dienes recommends setting the mean to 0 and to set the standard deviation of the expected effect size. The problem with this approach is again that the alternative hypothesis is centered over 0 (in a two-tailed test).   Moreover, the true effect size is not known. Like the scaling factor in the Cauchy distribution, using a higher value leads to a wider spread of alternative effect sizes and makes it harder to show evidence for small effects and easier to find evidence in favor of H0.   However, the r-code also allows specifying non-null means for the alternative hypothesis.   The next figure shows Bayes-Factors for three normally distributed alternative hypotheses. The solid lines show Bayes-Factors with mean = 0 and SD = .2. The dotted line shows Bayes-Factors for d = .2 (a small effect and the effect predicted by Bem) and a relatively wide standard deviation of .5. This means 95% of effect sizes are in the range from -.8 to 1.2. The broken (dot/dash) line shows Bayes-Factors with a mean of d = .2 and a narrower SD of d = .2. The 95% CI still covers a rather wide range of effect sizes from -.2 to .6, but due to the normal distribution effect sizes close to the expected effect size of d = .2 are weighted more heavily.

WagF4

The first observation is that centering the normal distribution over 0 leads to the same problem as the Cauchy-distribution. When the effect size is really 0, Bayes-Factors provide clear support for the null-hypothesis. However, when the effect size is small, d = .2, Bayes-Factors fail to provide support for the presence for samples with fewer than 150 participants (this is a ones-sample design, the equivalent sample size for between-subject designs is N = 600). The dotted line shows that simply moving the mean from d = 0 to d = .2 has relatively little effect on Bayes-Factors. Due to the wide range of effect sizes, a small effect is not sufficient to produce Bayes-Factors greater than 3 in small samples. The broken line shows more promising results. With d = .2 and SD = .2, Bayes-Factors in small samples with less than 100 participants are inconclusive. For sample sizes of more than 100 participants, both lines are above the criterion value of 3. This means, a Bayes-Factor of 3 or more can support the null-hypothesis when it is true and it can show that a small effect is present when an effect is present.

Another way to specify the alternative hypothesis is to use a one-tailed alternative hypothesis (a half-normal).   The mode (the center of the normal-distribution) of the distribution is 0. The solid line shows a standard deviation of .8. The dotted line shows results with standard deviation = .5 and the broken line shows results for a standard deviation of d = .2. The solid line favors the null-hypothesis and it requires sample sizes of more than 130 participants before an effect size of d = .2 produces a Bayes-Factor of 3 or more. In contrast, the broken line discriminates against the null-hypothesis and practically never supports the null-hypothesis when it is true. The dotted line with a standard deviation of .5 works best. It always shows support for the null-hypothesis when it is true and it can produce Bayes-Factors greater than 3 with a bit more than 100 participants.

WagF5

In conclusion, the simulations show that Bayes-Factors depend on the specification of the prior distribution and sample size. This has two implications. Unreasonable priors will lower the sensitivity/power of Bayes-Factors to support either the null-hypothesis or the alternative hypothesis when these hypotheses are true. Unreasonable priors will also bias the results in favor of one of the two hypotheses. As a result, researchers need to justify the choice of their priors and they need to be careful when they interpret results. It is particularly difficult to interpret Bayes-Factors when the alternative hypothesis is diffuse and the null-hypothesis is supported. In this case, the evidence merely shows that the null-hypothesis fits the data better than the alternative, but the alternative is a composite of many effect sizes and some of these effect sizes may fit the data better than the null-hypothesis.

Comparison of Different Prior Distributions with Bem’s (2011) ESP Experiments

To examine the influence of prior distributions on Bayes-Factors, I computed Bayes-Factors using several prior distributions. I used a d~Cauchy(1) distribution because this distribution was used by Wagenmakers et al. (2011). I used three uniform prior distributions with ranges of effect sizes from 0 to 1, 0 to .5, and .1 to .3. Based on Dienes recommendation, I also used a normal distribution centered on zero with the expected effect size as the standard deviation. I used both two-tailed and one-tailed (half-normal) distributions. Based on a twitter-recommendation by Alexander Etz, I also centered the normal distribution on the effect size, d = .2, with a standard deviation of d = .2.

Wag1 Table

The d~Cauchy(1) prior used by Wagenmakers et al. (2011) gives the weakest support for an effect. The table also includes the product of Bayes-Factors. The results confirm that the product is not a meaningful statistic that can be used to conduct a meta-analysis with Bayes-Factors. The last column shows Bayes-Factors based on a traditional fixed-effect meta-analysis of effect sizes in all 10 studies. Even the d~Cauchy(1) prior now shows strong support for the presence of an effect even though it often favored the null-hypotheses for individual studies. This finding shows that inferences about small effects in small samples cannot be trusted as evidence that the null-hypothesis is correct.

Table 1 also shows that all other prior distributions tend to favor the presence of an effect even in individual studies. Thus, these priors show consistent results for individual studies and for a meta-analysis of all studies. The strength of evidence for an effect is predictable from the precision of the alternative hypothesis. The uniform distribution with a wide range of effect sizes from 0 to 1, gives the weakest support, but it still supports the presence of an effect. This further emphasizes how unrealistic the Cauchy-distribution with a scaling factor of 1 is for most studies in psychology. For most studies in psychology effect sizes greater than 1 are rare. Moreover, effect sizes greater than one do not need fancy statistics. A simple visual inspection of a scatter plot is sufficient to reject the null-hypothesis. The strongest support for an effect is obtained for the uniform distribution with a range of effect sizes from .1 to .3. The advantage of this range is that the lower bound is not 0. Thus, effect sizes below the lower bound provide evidence for H0 and effect sizes above the lower bound provide evidence for an effect. The lower bound can be set by a meaningful consideration of what effect sizes might be theoretically or practically so small that they would be rather uninteresting even if they are real. Personally, I find uniform distributions appealing because they best express uncertainty about an effect size. Most theories in psychology do not make predictions about effect sizes. Thus, it seems impossible to say that an effect is expected to be small (d = .2) or moderate (d = .5). It seems easier to say that an effect is expected to be small (d = .1 to .3) or moderate (.3 to .6) or large (.6 to 1). Cohen used fixed values only because power analysis requires a single value. As Bayesian statistics allows the specification of ranges, it makes sense to specify a range of values with the need to make predictions which values in this range are more likely. However, results for the normal distribution provide similar results. Again, the strength of evidence of an effect increases with the precision of the predicted effect. The weakest support for an effect is obtained with a normal distribution centered over 0 and a two-tailed test. This specification is similar to a Cauchy distribution but it uses the normal distribution. However, by setting the standard deviation to the expected effect sizes, Bayes-Factors show evidence for an effect. The evidence for an effect becomes stronger by centering the distribution over the expected effect size or by using a half-normal (one-tailed) test that makes predictions about the direction of the effect.

To summarize, the main point is that Bayes-Factors depend on the choice of the alternative distribution. Bayesian statisticians are of course well aware of this fact. However, in practical applications of Bayesian statistics, the importance of the prior distribution is often ignored, especially when Bayes-Factors favor the null-hypothesis. Although this finding only means that the data support the null-hypothesis more than the alternative hypothesis, the alternative hypothesis is often described in vague terms as a hypothesis that predicted an effect. However, the alternative hypothesis does not just predict that there is an effect. It makes predictions about the strength of effects and it is always possible to specify an alternative that predicts an effect that is still consistent with the data by choosing a small effect size. Thus, Bayesian statistics can only produce meaningful results if researchers specify a meaningful alternative hypothesis. It is therefore surprising how little attention Bayesian statisticians have devoted to the issue of specifying the prior distribution. The most useful advice comes from Dienes recommendation to specify the prior distribution as a normal distribution centered over 0 and to set the standard deviation to the expected effect size. If researchers are uncertain about the effect size, they could try different values for small (d = .2), moderate (d = .5), or large (d = .8) effect sizes. Researchers should be aware that the current default setting of .707 in Rouder’s online app implies an expectation of a strong effect and that this setting will make it harder to show evidence for small effects and inflates the risk of obtaining false support for the null-hypothesis.

Why Psychologists Should not Change the Way They Analyze Their Data

Wagenmakers et al. (2011) did not simply use Bayes-Factors to re-examine Bem’s claims about ESP. Like several other authors, they considered Bem’s (2011) article an example of major flaws in psychological science. Thus, they titled their article with the rather strong admonition that “Psychologists Must Change The Way They Analyze Their Data.”   They blame the use of p-values and significance tests as the root cause of all problems in psychological science. “We conclude that Bem’s p values do not indicate evidence in favor of precognition; instead, they indicate that experimental psychologists need to change the way they conduct their experiments and analyze their data” (p. 426). The crusade against p-values starts with the claim that it is easy to obtain data that reject the null-hypothesis even when the null-hypothesis is true. “These experiments highlight the relative ease with which an inventive researcher can produce significant results even when the null hypothesis is true” (p. 427). However, this statement is incorrect. The probability of getting significant results is clearly specified by the type-I error rate. When the null-hypothesis is true, a significant result will emerge only 5% of the time; that is in 1 out of 20 studies. The probability of making a type-I error repeatedly decrease exponentially. For two studies, the probability to obtain two type-I errors is only p = .0025 or 1 out of 400 (20 * 20 studies).   If some non-significant results are obtained, the binomial probability gives the probability that the frequency of significant results that could have been obtained if the null-hypothesis were true. Bem obtained 9 out of 10 significant results. With a probability of p = .05, the binomial probability is 18e-10. Thus, there is strong evidence that Bem’s results are not type-I errors. He did not just go in his lab and run 10 studies and obtained 9 significant results by chance alone. P-values correctly quantify how unlikely this event is in a single study and how this probability decrease as the number of studies increases. The table also shows that all Bayes-Factors confirm this conclusion when the results of all studies are combined in a meta-analysis.   It is hard to see how p-values can be misleading when they lead to the same conclusion as Bayes-Factors. The combined evidence presented by Bem cannot be explained by random sampling error. The data are inconsistent with the null-hypothesis. The only misleading statistic is provided by a Bayes-Factor with an unreasonable prior distribution of effect sizes in small samples. All other statistics agree that the data show an effect.

Wagenmakers et al. (2011) next argument is that p-values only consider the conditional probability when the null-hypothesis is true, but that it is also important to consider the conditional probability if the alternative hypothesis is true. They fail to mention, however, that this alternative hypothesis is equivalent to the concept of statistical power. A p-values of less than .05 means that a significant result would be obtained only 5% of the time when the null-hypothesis is true. The probability of a significant result when an effect is present depends on the size of the effect and sampling error and can be computed using standard tools for power analysis. Importantly, Bem (2011) actually carried out an a priori power analysis and planned his studies to have 80% power. In a one-sample t-test, standard error is defined as 1/sqrt(N). Thus, with 100 participants, the standard error is .1. With an effect size of d = .2, the signal-to-noise ratio is .2/.1 = 2. Using a one-tailed significance test, the criterion value for significance is 1.66. The implied power is 63%. Bem used an effect size of d = .25 to suggest that he has 80% power. Even with a conservative estimate of 50% power, the likelihood ratio of obtaining a significant is .50/.05 = 10. This likelihood ratio can be interpreted like Bayes-Factors. Thus, in a study with 50% power, it is 10 times more likely to obtain a significant result when an effect is present than when the null-hypothesis is true. Thus, even in studies with modest power, favors the alternative hypothesis much more than the null-hypothesis. To argue that p-values provide weak evidence for an effect implies that a study had very low power to show an effect. For example, if a study has only 10% power, the likelihood ratio is only 2 in favor of an effect being present. Importantly, low power cannot explain Bem’s results because low power would imply that most studies produced non-significant results. However, he obtained 9 significant results in 10 studies. This success rate is itself an estimate of power and would suggest that Bem had 90% power in his studies. With 90% power, the likelihood ratio is .90/.05 = 18. The Bayesian argument against p-values is only valid for the interpretation of p-values in a single study in the absence of any information about power. Not surprisingly, Bayesians often focus on Fisher’s use of p-values. However, Neyman-Pearson emphasized the need to also consider type-II error rates and Cohen has emphasized the need to conduct power analysis to ensure that small effects can be detected. In recent years, there has been an encouraging trend to increase power of studies. One important consequence of high powered studies is that significant results increase the evidential value of significant results because a significant result is much more likely to emerge when an effect is present than when it is not present. However, it is important to note that the most likely outcome in underpowered studies is a non-significant result. Thus, it is unlikely that a set of studies can produce false evidence for an effect because a meta-analysis would reveal that most studies fail to show an effect. The main reason for the replication crisis in psychology is the practice not to report non-significant results. This is not a problem of p-values, but a problem of selective reporting. However, Bayes-Factors are not immune to reporting biases. As Table 1 shows, it would have been possible to provide strong evidence for ESP using Bayes-Factors as well.

To demonstrate the virtues of Bayesian statistics, Wagenmakers et al. (2011) then presented their Bayesian analyses of Bem’s data. What is important here, is how the authors explain the choice of their priors and how the authors interpret their results in the context of the choice of their priors.   The authors state that they “computed a default Bayesian t test” (p. 430). The important word is default. This word makes it possible to present a Bayesian analysis without a justification of the prior distribution. The prior distribution is the default distribution, a one-size-fits-all prior that does not need any further elaboration. The authors do note that “more specific assumptions about the effect size of psi would result in a different test.” (p. 430). They do not mention that these different tests would also lead to different conclusions because the conclusion is always relative to the specified alternative hypothesis. Even less convincing is their claim that “we decided to first apply the default test because we did not feel qualified to make these more specific assumptions, especially not in an area as contentious as psi” (p. 430). It is true that the authors are not experts on PSI, but that is hardly necessary when Bem (2011) presented a meta-analysis and  made an a prior prediction about effect size. Moreover, they could have at least used a half-Cauchy given that Bem used one-tailed tests.

The results of the default t-test are then used to suggest that “a default Bayesian test confirms the intuition that, for large sample sizes, one-sided p values higher than .01 are not compelling” (p. 430). This statement ignores their own critique of p-values that the compelingness of p-values depends on the power of a study. A p-value of .01 in a study with 10% power is not compelling because it is very unlikely outcome no matter whether an effect is present or not. However, in a study with 50% power, a p-value of .01 is very compelling because the likelihood ratio is 50. That is, it is 50 times more likely to get a significant result at p = .01 in a study with 50% power when an effect is present than when an effect is not present.

The authors then emphasize that they “did not select priors to obtain a desired result” (p. 430). This statement can be confusing to non-Bayesian readers. What this statement means is that Bayes-Factors do not entail statements about the probability that ESP exists or does not exist. However, Bayes-Factors do require specification of a prior distribution. Thus, the authors did select a prior distribution, namely the default distribution, and Table 1 shows that their choice of the prior distribution influenced the results.

The authors do directly address the choice of the prior distribution and state “we also examined other options, however, and found that our conclusions were robust. For a wide range of different non-default prior distributions on effect sizes, the evidence for precognition is either non-existent or negligible” (p. 430). These results are reported in a supplementary document. In these materials., the authors show how the scaling factor clearly influences results and that small scaling factors suggest an effect is present whereas larger scaling factors favor the null-hypothesis. However, Bayes-Factors in favor of an effect are not very strong. The reason is that the prior distribution is centered over 0 and a two-tailed test is being used. This makes it very difficult to distinguish the null-hypothesis from the alternative hypothesis. As shown in Table 1, priors that contrast the null-hypothesis with an effect provide much stronger evidence for the presence of an effect. In their conclusion, the authors state “In sum, we conclude that our results are robust to different specifiications of the scale parameter for the effect size prior under H1 “ This statement is more correct than the statement in the article, where they claim that they considered a wide range of non-default prior distributions. They did not consider a wide range of different distributions. They considered a wide range of scaling parameters for a single distribution; a Cauchy-distribution centered over 0.   If they had considered a wide range of prior distributions, like I did in Table 1, they would have found that Bayes-Factors for some prior distributions suggest that an effect is present.

The authors then deal with the concern that Bayes-Factors depend on sample size and that larger samples might lead to different conclusions, especially when smaller samples favor the null-hypothesis. “At this point, one may wonder whether it is feasible to use the Bayesian t test and eventually obtain enough evidence against the null hypothesis to overcome the prior skepticism outlined in the previous section.” The authors claimed that they are biased against the presence of an effect by a factor of 10e-24. Thus, it would require a Bayes-Factor greater than 10e24 to sway them that ESP exists. They then point out that the default Bayesian t-test, a Cauchi(0,1) prior distribution, would produce this Bayes-Factor in a sample of 2,000 participants. They then propose that a sample size of N = 2,000 is excessive. This is not a principled robustness analysis. A much easier way to examine what would happen in a larger sample, is to conduct a meta-analysis of the 10 studies, which already included 1,196 participants. As shown in Table 1, the meta-analysis would have revealed that even the default t-test favors the presence of an effect over the null-hypothesis by a factor of 6.55e10.   This is still not sufficient to overcome prejudice against an effect of a magnitude of 10e-24, but it would have made readers wonder about the claim that Bayes-Factors are superior than p-values. There is also no need to use Bayesian statistics to be more skeptical. Skeptical researchers can also adjust the criterion value of a p-value if they want to lower the risk of a type-I error. Editors could have asked Bem to demonstrate ESP with p < .001 rather than .05 in each study, but they considered 9 out of 10 significant results at p < .05 (one-tailed) sufficient. As Bayesians provide no clear criterion values when Bayes-Factors are sufficient, Bayesian statistics does not help editors in the decision process how strong evidence has to be.

Does This Mean ESP Exists?

As I have demonstrated, even Bayes-Factors using the most unfavorable prior distribution favors the presence of an effect in a meta-analysis of Bem’s 10 studies. Thus, Bayes-Factors and p-values strongly suggest that Bem’s data are not the result of random sampling error. It is simply too improbable that 9 out of 10 studies produce significant results when the null-hypothesis is true. However, this does not mean that Bem’s data provide evidence for a real effect because there are two explanations for systematic deviations from a random pattern (Schimmack, 2012). One explanation is that a true effect is present and that a study had good statistical power to produce a signal-to-noise ratio that produces a significant outcome. The other explanation is that no true effect is present, but that the reported results were obtained with the help of questionable research practices that inflate the type-I error rate. In a multiple study article, publication bias cannot explain the result because all studies were carried out by the same researcher. Publication bias can only occur when a researcher conducts a single study and reports a significant result that was obtained by chance alone. However, if a researcher conducts multiple studies, type-I errors will not occur again and again and questionable research practices (or fraud) are the only explanation for significant results when the null-hypothesis is actually true.

There have been numerous analyses of Bem’s (2011) data that show signs of questionable research practices (Francis, 2012; Schimmack, 2012; Schimmack, 2015). Moreover, other researchers have failed to replicate Bem’s results. Thus, there is no reason to believe in ESP based on Bem’s data even though Bayes-Factors and p-values strongly reject the hypothesis that sample means are just random deviations from 0. However, the problem is not that the data were analyzed with the wrong statistical method. The reason is that the data are not credible. It would be problematic to replace the standard t-test with the default Bayesian t-test because the default Bayesian t-test gives the right answer with questionable data. The reason is that it would give the wrong answer with credible data, namely it would suggest that no effect is present when a researcher conducts 10 studies with 50% power and honestly reports 5 non-significant results. Rather than correctly inferring from this pattern of results that an effect is present, the default-Bayesian t-test, when applied to each study individually, would suggest that the evidence is inconclusive.

Conclusion

There are many ways to analyze data. There are also many ways to conduct Bayesian analysis. The stronger the empirical evidence is, the less important the statistical approach will be. When different statistical approaches produce different results, it is important to carefully examine the different assumptions of statistical tests that lead to the different conclusions based on the same data. There is no superior statistical method. Never trust a statistician who tells you that you are using the wrong statistical method. Always ask for an explanation why one statistical method produces one result and why another statistical method produces a different result. If one method seems to make more reasonable assumptions than another (data are not normally distributed, unequal variances, unreasonable assumptions about effect size), use the more reasonable statistical method. I have repeatedly asked Dr. Wagenmakers to justify his choice of the Cauchi(0,1) prior, but he has not provide any theoretical or statistical arguments for this extremely wide range of effect sizes.

So, I do not think that psychologists need to change the way they analyze their data. In studies with reasonable power (50% or more), significant results are much more likely to occur when an effect is present than when an effect is not present, and likelihood ratios will show similar results as Bayes-Factors with reasonable priors. Moreover, the probability of a type-I errors in a single study is less important for researchers and science than long-term rate of type-II errors. Researchers need to conduct many studies to build up a CV, get jobs, grants, and take care of their graduate students. Low powered studies will lead to many non-significant results that provide inconclusive results. Thus, they need to conduct powerful studies to be successful. In the past, researchers often used questionable research practices to increase power without declaring the increased risk of a type-I error. However, in part due to Bem’s (2011) infamous article, questionable research practices are becoming less acceptable and direct replication attempts more quickly reveal questionable evidence. In this new culture of open science, only researchers who carefully plan studies will be able to provide consistent empirical support for a theory because the theory actually makes correct predictions. Once researchers report all of the relevant data, it is less important how these data are analyzed. In this new world of psychological science, it will be problematic to ignore power and to use the default Bayesian t-test because it will typically show no effect. Unless researches are planning to build a career on confirming the absence of effects, they should conduct studies with high-power and control type-I error rates by replicating and extending their own work.

Advertisements

Replacing p-values with Bayes-Factors: A Miracle Cure for the Replicability Crisis in Psychological Science

How Science Should Work

Lay people, undergraduate students, and textbook authors have a simple model of science. Researchers develop theories that explain observable phenomena. These theories are based on exploratory research or deduced from existing theories. Based on a theory, researchers make novel predictions that can be subjected to empirical tests. The gold-standard for an empirical test is an experiment, but when experiments are impractical, quasi-experiments or correlational designs may be used. The minimal design examines whether two variables are related to each other. In an experiment, a relation exists when an experimentally created variation produces variation in observations on a variable of interest. In a correlational study, a relation exists when two variables covary with each other. When empirical results show the expected covariation, the results are considered supportive of a theory and the theory lives another day. When the expected covariation is not observed, the theory is challenged. If repeated attempts fail to show the expected effect, researchers start developing a new theory that is more consistent with the existing evidence. In this model of science, all scientists are only motivated by the goal to build a theory that is most consistent with a robust set of empirical findings.

The Challenge of Probabilistic Predictions and Findings

I distinguish two types of science; the distinction maps onto the distinction between hard and soft sciences, but I think the key difference between the two types of science is whether theories are used to test deterministic relationships (i.e., relationships that hold in virtually every test of the phenomenon) and probabilistic relationships, where a phenomenon may be observed only some of the time. An example of deterministic science is chemistry where the combination of oxygen and halogen leads to an explosion and water, when halogen and oxygen atoms combine to form H20. An example, of probabilistic science is a classic memory experiment where more recent information is more likely to be remembered than more remote information, but memory is not deterministic and it is possible that remote information is sometimes remembered better than recent information.   A unique challenge for probabilistic science is to interpret empirical evidence because it is possible to make two errors in the interpretation of empirical results. These errors are called type-I and type-II errors.

Type-I errors refer to the error that the data show a theoretically predicted result when the prediction is false.

Type-II errors refer to the error that the data do not show a theoretically predicted result when the prediction is correct.

There are many reasons why a particular study may produce misleading results. Most prominently, a study may have failed to control (experimentally or statistically) for confounding factors. Another reason could be that a manipulation failed or a measure failed to measure the intended construct. Aside from these practical problems in conducting an empirical study, type-I and type-II errors can still emerge even in the most carefully conducted study with perfect measures. The reason is that empirical results in tests of probabilistic hypothesis are influenced by factors that are not under the control of the experimenter. These causal factors are sometimes called random error, sampling error, or random sampling error. The main purpose of inferential statistics is to deal with type-I and type-II errors that are caused by random error. It is also possible to conduct statistical analysis without drawing conclusions from the results. These statistics are often called descriptive statistics. For example, it is possible to compute and report the mean and standard deviation of a measure, the mean difference between two groups, or the correlation between two variables in a sample. As long as these results are merely reported they simply describe an empirical fact. They also do not test a theoretical hypothesis because scientific theories cannot make predictions about empirical results in a specific sample. Type-I or Type-II errors occur when the empirical results are used to draw inferences about results in future studies, in the population, or about the truth of theoretical predictions.

Three Approaches to the Problem of Probabilistic Science

In the world of probabilities, there is no certainty, but there are different degrees of uncertainty. As the strength of empirical evidence increases, it becomes less likely that researchers make type-I or type-II errors.   The main aim of inferential statistics is to provide objective and quantitative information about the probability that empirical data provide the correct information about the hypothesis; that is to avoid making a type-I or type-II error.

Statisticians have developed three schools of thought: Fisherian, Neyman-Pearson, and Bayesian statistics. The problem is that contemporary proponents of these approaches are still fighting about the right approach. As a prominent statistician noted, “the effect on statistics of having three (actually more) warring factions… has not been good for our professional image” (Berger, 2003, p. 4). He goes on to note that statisticians have failed to make “a concerted professional effort to provide the scientific world with a unified testing methodology.”

For applied statisticians the distinction between Fisher and Neyman-Pearson is of relatively little practical concern because both approaches rely on the null-hypothesis and p-values. Statistics textbook often do present a hybrid model of both approaches. The Fisherian approach is to treat p-values as a measure of the strength of evidence against the null-hypothesis. As p-values approach zero, it becomes less and less likely that the null-hypothesis is true. For example, imagine a researcher computes the correlation between height and weight in a sample of N = 10 participants. The correlation is r = .50. Given the small sample size, this extreme deviation from the null-hypothesis could still have occurred by chance. As the sample size increases, random factors can produce only smaller and smaller deviations from zero and an observed correlation of r = .50 becomes less and less likely to have occurred as a result of random sampling error (oversampling tall and heavy participants and undersampling short and lightweight).

The main problem for Fisher’s approach is that it provides no guidelines about the size of a p-value that should be used to reject the null-hypothesis (there is no correlation) and therewith confirm the alternative (there is a correlation). Thus, p-values provide a quantitative measure of evidence against the null-hypothesis, but they do not provide a decision rule how strong the evidence should be to conclude that the null-hypothesis is false. As such, one might argue that Fisher’s approach is not an inferential statistical approach because it does not spell out how researchers should interpret p-values. Without a decision rule, a p-value is just an objective statistic like a sample mean or standard deviation.

Neyman-Pearson solved the problem of inference by introducing a criterion value. The most common criterion value is p = .05. When the strength of the evidence against the null-hypothesis leads to a p-value less than .05, the null-hypothesis is rejected. When the p-value is above the criterion, the null-hypothesis is accepted. According to Berger (2003), Neyman-Pearson also advocated to compute and report type-I and type-II error probabilities. Evidently, this suggestion has not been adopted in applied research, especially with regard to type-II error probabilities. The main reason for not adopting Neyman-Pearson’s recommendation is that the type-II error rate depends on an a priori assumption about the size of an effect. However, many hypothesis in the probabilities sciences make only diffuse, qualitative predictions (e.g., height will be positively correlated with weight, but the correlation may range anywhere from r = .1 to .8). Applied researchers saw little value in computing type-II error rates that are based on subjective assumptions about the strength of an effect. Instead, they adopted the criterion approach by Neyman-Pearson, but they used the criterion only to make the decision that the null-hypothesis is false when the evidence was strong enough to reject the null-hypothesis (p < .05). In contrast, when the evidence was not strong enough to reject the null-hypothesis, the results were considered inconclusive. The null-hypothesis could be true or the results were a type-II error. It was not important to determine whether the null-hypothesis was true or not because researchers were mainly interested in demonstrating causal relationships (a drug is effective) than in showing that something does not have an effect (a drug is not effective). By avoiding to rule in favor of the null-hypothesis, researchers could never make a type-II error in the classical sense that they falsely accepted the null-hypothesis. In this context, the term type-II error assumed a new meaning. A type-II error now meant that the study had insufficient statistical power to demonstrate that the null-hypothesis was false. A study with more statistical power might be able to produce a p-value less than .05 and demonstrate that the null-hypothesis is false.

The appeal of the hybrid approach was that the criterion provided meaningful information about the type-I error and that the type-II error rate was zero because results were never interpreted as favoring the null-hypothesis. The problem of this approach is that it can never lead to the conclusion that an effect is not present. For example, it is only possible to demonstrate gender differences, but it is never possible to demonstrate that men and women do not differ from each other. The main problem with this one-sided testing approach was that non-significant results seemed unimportant because they were inconclusive and it seemed more important to report conclusive, significant results than inconclusive and insignificant results. However, if only significant results are reported, it is no longer clear how many of these significant results might be type-I errors (Sterling, 1959). If only significant results are reported, the literature will be biased and can contain an undetermined amount of type-I errors (false evidence for an effect when the null-hypothesis is true). However, this is not a problem of p-values. It is a problem of not reporting studies that failed to provide support for a hypothesis, which is needed to reveal type-I errors. As type-I errors would occur only at a rate of 1 out of 20, honest reporting of all studies would quickly reveal which significant results are type-I errors.

Bayesian Statistics

The Bayesian tradition is not a unified approach to statistical inference. The main common element of Bayesian statistics is to criticize p-values because they do not provide information about the probability that a hypothesis is true; p(H1|D). Bayesians argue that empirical scientists misinterpret p-values as estimates of the probability that a hypothesis is true, when they quantify merely the probability that the data could have been produced without an effect. The main aim of Bayesian statistics is to use the Bayes Theorem to obtain an estimate of p(H1|D) from the empirically observed data.

BF3

One piece of information is the probability of an empirical observed statistic when the null-hypothesis is true, p(D|H0). This probability is closely related to p-values. Whereas the Bayesian p(D|H0) is the probability of obtaining a particular test statistic (e.g., a z-score of 1.65), p-values quantify the probability of obtaining a test statistic greater (one-sided) than the observed test statistic (p[z > 1.65] = .05) [for the two-sided case, p[abs(z) = 1.96] = .05]

The problem for estimating the probability that the hypothesis is true given an empirical result depends on three more probabilities that are unrelated to the observed data, namely the probability that the hypothesis is true, P(H0), the probability that the alternative hypothesis is true, p(H1), and the probability that the data would have been observed if the alternative hypothesis is true, p(D|H1). One approach to the problem of three unknowns is to use prior knowledge or empirical data to estimate these parameters. However, the problem for many empirical studies is that there is very little reliable a priori information that can be used to estimate these parameters.

A group of Bayesian psychologists has advocated an objective Bayesian approach to deal with problem of unknown parameters in Bayes’ Theorem (Wagenmakers et al., 2011). To deal with the problem that p(H1|D) is unknown, the authors advocate using a default a priori probability distribution of effect sizes. The next step is to compute the ratio of p(H0|D) and p(H1|D). This ratio is called the Bayes-Factor. The following formula shows that the probability of the null-hypothesis being true given the data, p(H0|D), increases as the Bayes-Factor, p(D|H0)/p(D|H1) increases. Similarly, the probability of the alternative hypothesis given the data, p(H1|D) increases as the Bayes-Factor decreases. To quantify these probabilities, one would need to make assumptions about p(H0) and p(H1), but even without making assumptions about these probabilities, it is clear that the ratio of p(H0|D)/p(H1|D) is proportional to p(D|H0)/p(D|H1).

BF4

Bayes-Factors have two limitations. First, like p-values, Bayes-Factors alone are insufficient for inferential statistics because they only quantify the relative evidence in favor of two competing hypotheses. It is not clear at which point the results of a study should be interpreted as evidence for one of the two hypotheses. For example, is a Bayes-Factor of 1.1, 2.5, 3, 10, or 100 sufficient to conclude that the null-hypothesis is true? The second problem is that the default function may not adequately characterize the alternative hypothesis. In this regard, Bayesian statistics have the same problem as Neyman-Pearson’s approach that required making a priori assumptions about the effect size in order to compute type-II error rates.  In Bayesian statistic the a priori distribution of effect sizes influences the Bayes-Factor.

In response to the first problem, Bayesians often use conventional criterion values that are used to make decisions based on empirical data. Commonly used criterion values are a Bayes-Factor of 3 or 10. A decision rule is clearly implemented in Bayesian studies with optional stopping where a Bayes-Factor of 10 or greater is used to justify terminating a study early. Bayes-Factors with a decision criterion create a new problem in that it is now possible to obtain inconclusive results and results that favor the null-hypothesis. As a result, there are now two types of type-II errors. Some type-II errors occur when the BF meets the criterion to accept the null-hypothesis when the null-hypothesis is false. Other type-II errors occur when the null-hypothesis is false and the data are inconclusive.

So far, Bayesian statisticians have not examined type-II error rates with the argument that Bayes-Factors do not require researchers to make decisions. However, without clear decision rules, Bayes-Factors are not very appealing to applied scientists because researchers, reviewers, editors, and readers need some rational criterion to make decisions about publication and planning of future studies. The best way to provide this information would be to examine how often Bayes-Factors of a certain magnitude lead to false conclusions; that is, to determine the type-I and type-II(a,b) error rates that are associated with a Bayes-Factor of a certain magnitude. This question has not been systematically examined.

The Bayesian Default T-Test

As noted above, there is no unified Bayesian approach to statistical inference. Thus, it is impossible to make general statements about Bayesian statistics. Here I focus on the statistical properties of the default Bayesian t-test (Rouder, Speckman, Sun, Morey, & Iverson, 2009). Most prominently, this test was used to demonstrate the superiority of Bayes-Factors over p-values with Bem’s (2011) controversial set of studies that seemed to support extrasensory perception.

The authors provide an R-package with a function that computes Bayes-Factors based on the observed t-statistic and degrees of freedom. It is noteworthy that the Bayes-Factor is fully determined by the t-value, the degrees of freedom, and a default scaling parameter for the prior distribution. As t-values and df are also used to compute p-values, Bayes-Factors and p-values are related to each other.  The main difference is that p-values have a constant meaning for different sample sizes. That is, p = .04 has the same meaning in studies with N = 10, 100, or 1000 participants. However, Bayes-Factors for the same t-value changes as a function of sample size.

“With smaller sample sizes that are insufficient to differentiate between approximate and exact invariances, the Bayes factors allows researchers to gain evidence for the null. This evidence may be interpreted as support for at least an approximate invariance. In very large samples, however, the Bayes factor allows for the discovery of small perturbations that negate the existence of an exact invariance.” (Rouder et al., 2009, p 233).

This means that the same population effect size can produce three different outcomes depending on sample size; it may show evidence in favor of the null-hypothesis with a small sample size, it may show inconclusive results with a moderate sample size, and it may show evidence for the alternative hypothesis with a large sample size.

The ability to compute Bayes-Factors and p-values from t-values also implies that for a fixed sample size, p-values can be directly transformed into Bayes-Factors and vice versa. This makes it easy to directly compare the inferences that can be drawn from observed t-values for different p-values and Bayes-Factors.

The simulations used the default setting of a Cauchi distribution with a scale parameter of .707.

BF5

The x-axis shows potential effect sizes. The y-axis shows the weight attached to different effect sizes. The Cauchy distribution is centered over zero, giving the highest probability to an effect size of d = 0. As effect sizes increase weights decrease. However, even effect sizes greater than d = .8 (strong effect, Cohen, 1988) still have notable weights and the distribution includes effect sizes above d = 2. It is important to keep in mind that Bayes-Factors express the relative strength of evidence for or against the null-hypothesis relative to the weighted average effect size implied by the default function. Thus, it is possible that a Bayes-Factor favors the null-hypothesis if the population effect size is small because a small effect size is inconsistent with a prior distribution that considers strong effect sizes as a possible outcome.

The next figure shows Bayes-Factors as a function of p-values for an independent group t-test with n = 50 per condition. The black line shows the Bayes-Factor for H1 over H0. The red line shows the Bayes-Factor for H0 over H1. I show both ratios because I find it easier to compare Bayes-Factors greater than 1 than Bayes-Factors less than 1. The two lines cross when BF = 1, which is the point where the data favor both hypothesis equally.

BF6

The graph shows the monotonic relationship between Bayes-Factors and p-values. As p-values decrease BF10 (favor H1 over H0, black) increases. As p-values increase, BF01-values (favor H0 over H1, red) also increase. However, the shapes of the two curves are rather different. As p-values decrease, the black line stays flat for a long time. As p-values are around p = .2, the curve goes up. It reaches a value of 3 just below a p-value of .05 (marked by the green line) and then increases quickly. This graph suggests that a Bayes-Factor of 3 corresponds roughly to a p-value of .05. A Bayes-Factor of 10 would correspond to a more stringent p-value. The red curve has a different shape. Starting from the left, it rises rather quickly and then slows down as p-values move towards 1. BF01 cross the red dotted line marking BF = 3 at around p = .3, but it never reaches a factor of 10 in favor of the null-hypothesis. Thus, using a criterion of BF = 3, p-values higher than .3 would be interpreted as evidence in favor of the null-hypothesis.

The next figure shows the same plot for different sample sizes.

BF7

The graph shows how the Bayes-Factor of H0 over H1 (red line) increases as a function of sample size. It also reaches the critical value of BF = 3 earlier and earlier. With n = 1000 in each group (total N = 2000) the default Bayesian test is very likely to produce strong evidence in favor of either H1 or H0.

The responsiveness of BF01 to sample size makes sense. As sample size increases, statistical power to detect smaller and smaller effects also increases. In the limit a study with an infinite sample size has 100% power. That means, when the whole population has been studied and the effect size is zero, the null-hypothesis has been proven. However, even the smallest deviation from zero in the population will refute the null-hypothesis because sampling error is zero and the observed effect size is different from zero.

The graph also shows that Bayes-Factors and p-values provide approximately the same information when H1 is true. Statistical decisions based on BF10 or p-values lead to the same conclusion for matching criterion values. The standard criterion of p = .05 corresponds approximately to BF10 = 3 and BF10 = 10 corresponds roughly to p = .005. Thus, Bayes-Factors are not less likely to produce type-I errors than p-values because they reflect the same information, namely how unlikely it is that the deviation from zero in the sample is simply due to chance.

The main difference between Bayes-Factors and p-values arises in the interpretation of non-significant results (p > .05, BF10 < 3). The classic Neyman-Pearson approach would treat all non-significant results as evidence for the null-hypothesis, but would also try to quantify the type-II error rate (Berger, 2003). The Fisher-Neyman-Pearson hybrid approach treats all non-significant results as inconclusive and never decides in favor of the null-hypothesis. The default Bayesian t-tests distinguishes between inconclusive results and those that favor the null-hypothesis. To distinguish between these two conclusions, it is necessary to postulate a criterion value. Using the same criterion that is used to rule in favor of the alternative hypothesis (p = .05 ~ BF10 = 3), a BF01 > 3 is a reasonable criterion to decide in favor of the null-hypothesis. Moreover, a more stringent criterion would not be useful in small samples, because BF01 can never reach values of 10 or higher. Thus, in small samples, the conclusion would always be the same as in the standard approach that treats all non-significant results as inconclusive.

Power, Type I, and Type-II Error rates of the default Bayesian t-test with BF=3 as criterion value

As demonstrated in the previous section, the results of a default Bayesian t-test depend on the amount of sampling error, which is fully determined by sample size in a between-subject design. The previous results also showed that the default Bayesian t-test has modest power to rule in favor of the null-hypothesis in small samples.

For the first simulation, I used a sample size of n = 50 per group (N = 100). The reason is that Wagenmakers and colleagues have conducted several pre-registered replication studies with a stopping rule when sample size reaches N= 100. The simulation examines how often a default t-test with 100 participants can correctly identify the null-hypothesis when the null-hypothesis is true. The criterion value was set to BF01 = 3. As the previous graph showed, this implies that any observed p-value of approximately p = .30 to 1 is considered to be evidence in favor of the null-hypothesis. The simulation with 10,000 t-tests produced 6,927 BF01s greater than 3. This result is to be expected because p-values follow a uniform distribution when the null-hypothesis is true. Therefore, the p-value that corresponds to BF01 = 3 determines the rate of decisions in favor of null. With p = .30 as the criterion value that corresponds to BF01 = 3, 70% of the p-values are in the range from .30 to 1. 70% power may be deemed sufficient.

The next question is how the default Bayesian t-test behaves when the null-hypothesis is false. The answer to this question depends on the actual effect size. I conducted three simulation studies. The first simulation examined effect sizes in the moderate to large range (d = .5 to .8). Effect sizes were uniformly distributed. With a uniform distribution of effect sizes, true power ranges from 70% to 97% with an average power of 87% for the traditional criterion value of p = .05 (two-tailed). Consistent with this power analysis, the simulation produced 8704 significant results. Using the BF10 = 3 criterion, the simulation produced 7405 results that favored the alternative hypothesis with a Bayes-Factor greater than 3. The power is slightly lower than for p=.05 because BF = 3 is a slightly stricter criterion. More important, the power of the test to show support for the alternative is about equal to the power to support the null-hypothesis; 74% vs. 70%, respectively.

The next simulation examined effect sizes in the small to moderate range (d = .2 to .5). Power ranges from 17% to 70% with an average power of 42%. Consistent with this prediction, the simulation study with 10,000 t-tests produced 4072 significant results with p < .05 as criterion. With the somewhat stricter criterion of BF = 3, it produced only 2,434 results that favored the alternative hypothesis with BF > 3. More problematic is the finding that it favored the null-hypothesis (BF01 > 3) nearly as often, namely 2405 times. This means, that in a between-subject design with 100 participants and a criterion-value of BF = 3, the study has about 25% power to demonstrate that an effect is present, it will produce inconclusive results in 50% of all cases, and it will falsely support the null-hypothesis in 25% of all cases.

Things get even worse when the true effect size is very small (d > 0, d < .2). In this case, power ranges from just over .05, the type-I error rate, to just under 17% for d = .2. The average power is just 8%. Consistent with this prediction, the simulation produced only 823 out of 10,000 significant results with the traditional p = .05 criterion. The stricter BF = 3 criterion favored the alternative hypothesis in only 289 out of 10,000 cases with a BF greater than 3. However, BF01 exceeded a value of 3 in 6201 cases. The remaining 3519 cases produced inconclusive results. In this case, the Bayes-Factor favored the null-hypothesis when it was actually false. The rate of false decisions in favor of the null-hypothesis is nearly as high as the power of the test to correctly identify the null-hypothesis (62% vs. 70%).

The previous analyses indicate that Bayes-Factors produce meaningful results when power to detect an effect is high, but that Bayes-Factors are at risk to falsely favor the null-hypothesis when power is low. The next simulation directly examined the relationship between power and Bayes-Factors. The simulation used effect sizes in the range from d = .001 to d = 8 with N = 100. This creates a range of power from 5 to 97% with an average power of 51%.

BF8

In this figure, red data points show BF01 and blue data points show BF10. The right side of the figure shows that high-powered studies provide meaningful information about the population effect size as BF10 tend to be above the criterion value of 3 and BF01 are very rarely above the criterion value of 3. In contrast, on the left side, the results are misleading because most of the blue data points are below the criterion value of 3 and many BF01 data points are above the criterion value of BF = 3.

What about the probability of the data when the default alternative hypothesis is true?

A Bayes-Factor is defined as the ratio of two probabilities, the probability of the data when the null-hypothesis is true and the probability of the data when the null-hypothesis is false.  As such, Bayes-Factors combine information about two hypotheses, but it might be informative to examine each hypothesis separately. What is the probability of the data when the null-hypothesis is true and what is the probability of the data when the alternative hypothesis is true? To examine this, I computed p(D|H1) by dividing the p-values by BF01 for t-values in the range from 0 to 5.

BF01 = p(D|H0) / p(D|H1)   =>    p(D|H1) = BF01 * p(D|H0)

As Bayes-Factors are sensitive to sample size (degrees of freedom), I repeated the analysis with N = 40 (n = 20), N = 100 (n = 50), and N = 200 (n = 100).

BF9

The most noteworthy aspect of the figure is that p-values (the black line, p(D|H0)), are much more sensitive to changes in t-values than the probabilities of the data given the alternative hypothesis (yellow N=40, orange N=100, red N=200). The reason is the diffuse nature of the alternative hypothesis. It always includes a hypothesis that predicts the test-statistic, but it also includes many other hypotheses that make other predictions. This makes the relationship between the observed test-statistic, t, and the probability of t given the diffuse alternative hypothesis dull. The figure also shows that p(D|H0) and p(D|H1) both decrease monotonically as t-values increase. The reason is that the default prior distribution has its mode over 0. Thus, it also predicts that an effect size of 0 is the most likely outcome. It is therefore not a real alternative hypothesis that predicts an alternative effect size. It merely is a function that has a more muted relationship to the observed t-values. As a result, it is less compatible with low t-values and more compatible with high t-values than the steeper function for the point-null hypotheses.

Do we need Bayes-Factors to Provide Evidence in Favor of the Null-Hypothesis?

A common criticism of p-values is that they can only provide evidence against the null-hypothesis, but that they can never demonstrate that the null-hypothesis is true. Bayes-Factors have been advocated as a solution to this alleged problem. However, most researchers are not interested in testing the null-hypothesis. They want to demonstrate that a relationship exists. There are many reasons why a study may fail to produce the expected effect. However, when the predicted effect emerges, p-values can be used to rule out (with a fixed error probability) that the effect emerged simply as a result of chance alone.

Nevertheless, non-Bayesian statistics could also be used to examine whether a null-hypothesis is true without the need to construct diffuse priors or to compare the null-hypothesis to an alternative hypothesis. The approach is so simple that it is hard to find sources that explain it. Let’s assume that a researcher wants to test the null-hypothesis that Bayesian statisticians and other statisticians are equally intelligent. The researcher recruits 20 Bayesian statisticians and 20 frequentist statisticians and administers an IQ test. The Bayesian statisticians have an average IQ of 130 points. The frequentists have an average IQ of 120 points. The standard deviation of IQ scores on this IQ test is 15 points. Moreover, it has been shown that IQ scores are approximately normally distributed. Thus, sampling error is defined as 15 * (2 / sqrt(40)) = 4.7 ~ 5. The figure below shows the distribution of difference scores under the assumption that the null-hypothesis is true. The red lines show the 95% confidence interval. A 5 point difference is well within the 95% confidence interval. Thus, the result is consistent with the null-hypothesis that there is no difference in intelligence between the two groups. Of course, a 5 point difference is one-third of a standard deviation, but the sample size is simply too small to infer from the data that the null-hypothesis is false.

BF10

A more stringent test of the null-hypothesis would require a larger sample. A frequentist researcher conducts a power analysis and assumes that only a 5 point difference or more would be meaningful. She conducts a power analysis and finds that a study with 143 participants in each group (N = 286) is needed to have 80% power to show a difference of 5 points or more. A non-significant result would suggest that the difference is smaller or that a type-II error occurred with a 20% probability. The study yields a mean of 128 for frequentists and 125 for Bayesians. The 3 point difference is not significant. As a result, the data support the null-hypothesis that Bayesians and Frequentists do not differ in intelligence by more than 5 points. A more stringent test of equality or invariance would require an even larger sample. There is no magic Bayesian bullet that can test a precise null-hypothesis in small samples.

Ignoring Small Effects is Rational: Parsimony and Occam’s Razor

Another common criticism of p-values is that they are prejudice against the null-hypothesis because it is always possible to get a significant result simply by increasing sample size. With N = 1,000,000, a study has 95% power to detect even an effect size of d = .007. The argument is that it is meaningless to demonstrate significance in smaller samples, if it is certain that significance can always be obtained in a larger sample. The argument is flawed because it is simply not true that p-values will eventually produce a significant result when sample sizes increase. P-values will only produce significant results when a true effect exists. When the null-hypothesis is true an honest test of the hypothesis will only produce as many significant results as the type-I error criterion specifies. Moreover, Bayes-Factors are no solution to this problem. When a true effect exists, they will also favor the alternative hypothesis no matter how small the effect is and when sample sizes are large enough to have sufficient power. The only difference is that Bayes-Factors may falsely accept the null-hypothesis in smaller samples.

The more interesting argument against p-value is not that significant results in large studies are type-I errors, but that these results are practically meaningless. To make this point, statistics books often distinguish statistical significance and practical significance and warn that statistically significant results in large samples may have little practical significance. This warning was useful in the past when researchers would only report p-values (e.g., women have higher verbal intelligence than men, p < .05). The p-value says nothing about the size of the effect. When only the p-value is available, it makes sense to assume that significant results in smaller samples are larger because only large effects can be significant in these samples. However, large effects can also be significant in large samples and large effects in small studies can be inflated by sampling error. Thus, the notion of practical significance is outdated and should be replaced by questions about effect sizes. Neither p-values nor Bayes-Factors provide information about the size of the effect or the practical implications of a finding.

How can p-values be useful when there is clear evidence of a replication crisis?

Bem (2011) conducted 10 studies to demonstrate experimental evidence for anomalous retroactive influences on cognition and affect. His article reports 9 significant results and one marginally significant result. Subsequent studies have failed to replicate this finding. Wagenmakers et al. (2011) used Bem’s results as an example to highlight the advantages of Bayesian statistics. The logic was that p-values are flawed and that Bayes-Factors would have revealed that Bem’s (2011) evidence was weak. There are several problems with Wagenmaker et al.’s (2011) Bayesian analysis of Bem’s data.

First, the reported results differ from the default Bayesian-test implemented on Dr. Rouder’s website (http://pcl.missouri.edu/bf-one-sample). The reason is that Bayes-Factors depend on a scaling factor of the Cauchy distribution. Wagenmakers et al. (2011) used a scaling factor of 1, whereas the online app used .707 as the default. The choice of a scaling parameter gives some degrees of freedom to researchers. Researchers who favor the null-hypothesis can choose a larger scaling factor which makes the alternative hypothesis more extreme and easier to reject with small effects. Smaller scaling factors make the Cauchy-distribution narrower and it is easier to show evidence in favor of the alternative hypothesis with smaller effects. The behavior of Bayes-Factors for different scaling parameters is illustrated in Table 1 with Bem’s data.

BF11
 

Experiment 7 is highlighted because Bem (2011) already interpreted the non-significant result in this study as evidence that the effect disappears with supraliminal stimuli; that is, visible stimuli. The Bayes-Factor would support Bem’s (2011) conclusion that Experiment 7 shows evidence that the effect does not exist under this condition. The other studies essentially produced inconclusive Bayes-Factors, especially for the online default-setting with a scaling factor of .707. The only study that produced clear evidence for ESP was experiment 9. This study had the smallest sample size (N = 50), but a large effect size that was twice the effect size in the other studies. Of course, this difference is not reliable due to the small sample size, but it highlights how sensitive Bayes-Factors are to sampling error in small samples.

Another important feature of the Bayesian default t-test is that it centers the alternative hypothesis over 0. That is, it assigns the highest probability to the null-hypothesis, which is somewhat odd as the alternative hypothesis states that an effect should be present. The justification for this default setting is that the actual magnitude of the effect is unknown. However, it is typically possible to formulate an alternative hypothesis that allows for uncertainty, while predicting that the most likely outcome is a non-null effect size. This is especially true when previous studies provide some information about expected effect sizes. In fact, Bem (2011) explicitly planned his study with the expectation that the true effect size is small, d ~ .2. Moreover, it was demonstrated above that the default t-test is biased against small effects. Thus, the default Bayesian t-test with a scaling factor of 1 does not provide a fair test of Bem’s hypothesis against the null-hypothesis.

It is possible to use the default t-test to examine how consistent the data are with Bem’s (2011) a priori prediction that the effect size is d = .2. To do this, the null-hypothesis can be formulated as d = .2 and t-values can be computed as deviations from a population parameter d = .2. In this case, the null-hypothesis presents Bem’s (2011) a priori prediction and the alternative prediction is that observed effect sizes will deviated from this prediction because the effect is smaller (or larger). The next table shows the results for the Bayesian t-test that tests H0: d = .2 against a diffuse alternative H1: Cauchy-distribution centered over d = .2. Results are presented as BF01 so that Bayes-Factors greater than 3 indicate support for Bem’s (2011) prediction.

BF12

The Bayes-Factor supports Bem’s prediction in all tests. Choosing a wider alternative this time provides even stronger support for Bem’s prediction because the data are very consistent with the point prediction of a small effect size, d = .2. Moreover, even Experiment 7 now shows support for the hypothesis because an effect size of d = .09 is still more likely to have occurred when the effect size is d = .2 than for a wide-range of other effect sizes. Finally, Experiment 9 now shows the weakest support for the hypothesis. The reason is that Bem used only 50 participants in this study and the effect size was unusually large. This produced a low p-value in a test against zero, but it also produced the largest deviation from the a priori effect size of d = .2. However, this is to be expected in a small sample with large sampling error. Thus, the results are still supportive, but the evidence is rather weak compared to studies with larger samples and effect sizes close to d = 2.

The results demonstrate that Bayes-Factors cannot be interpreted as evidence for or against a specific hypothesis. They are influenced by the choice of the hypotheses that are being tested. In contrast, p-values have a consistent meaning. They quantify how probable it is that random sampling error alone could have produced a deviation between an observed sample parameter and a postulated population parameter. Bayesians have argued that this information is irrelevant and does not provide useful information for the testing of hypotheses. Although it is true that p-values do not quantify the probability that a hypothesis is true when significant results were observed, Bayes-Factors also do not provide this information. Moreover, Bayes-Factors are simply a ratio of two probabilities that compare two hypotheses against each other, but usually only one of the hypotheses is of theoretical interest. Without a principled and transparent approach to the formulation of alternative hypotheses, Bayes-Factors have no meaning and will change depending on different choices of the alternatives. The default approach aims to solve this by using a one-size-fits-all solution to the selection of priors. However, inappropriate priors will lead to invalid results and the diffuse Cauchy-distribution never fits any a priori theory.

 Conclusion

Statisticians have been fighting for supremacy for decades. Like civilians in a war, empirical scientists have suffered from this war because they have been bombarded by propaganda and they have been criticized that they misunderstand statistics or use the wrong statistics. In reality, the statistical approaches are all related to each other and they all rely on the ratio of the observed effect sizes to sampling error (i.e, the signal to noise ratio) to draw inferences from observed data about hypotheses. Moreover, all statistical inferences are subject to the rule that studies with less sampling error provide more robust empirical evidence than studies with more sampling error. The biggest challenge for empirical researchers is to optimize the allocation of resources so that each study has high statistical power to produce a significant result when an effect exists. With high statistical power to detect an effect, p-values are likely to be small (50% chance to get a p-value of .005 or lower with 80% power) and Bayes-Factors and p-values provide virtually the same information for matching criterion values, when an effect is present. High power also implies a relative low frequency of type-II errors, which makes it more likely that a non-significant result occurred because the hypothesis is wrong.  Thus, planning studies with high power is important no matter whether data are analyzed with Frequentist or Bayesian statistics.

Studies that aim to demonstrate the lack of an effect or an invariance (there is no difference in intelligence between Bayesian and frequentist statisticians) need large samples to demonstrate invariance or have to accept that there is a high probability that a larger study would find a reliable difference. Bayes-Factors do not provide a magical tool to provide strong support for the null-hypothesis in small samples. In small samples Bayes-Factors can falsely favor the null-hypothesis even when effect sizes are in the moderate to large range.

In conclusion, like p-values, Bayes-Factors are not wrong.  They are mathematically defined entities.  However, when p-values or Bayes-Factors are used by empirical scientists to interpret their data, it is important that the numeric results are interpreted properly.  False interpretation of Bayes-Factors is just as problematic as false interpretation of p-values.  Hopefully, this blog post provided some useful information about Bayes-Factors and their relationship to p-values.

 

Further reflections on the linearity in Dr. Förster’s Data

The Jens Forster saga continues. Results are too good to be true, but how did this happen?

Replicability-Index

A previous blog examined how and why Dr. Förster’s data showed incredibly improbable linearity.

The main hypothesis was that two experimental manipulations have opposite effects on a dependent variable.

Assuming that the average effect size of a single manipulation is similar to effect sizes in social psychology, a single manipulation is expected to have an effect size of d = .5 (change by half a standard deviation). As the two manipulations are expected to have opposite effects, the mean difference between the two experimental groups should be one standard deviation (0.5 + 0.5 = 1). With N = 40, and d = 1, a study has 87% power to produce a significant effect (p < .05, two-tailed). With power of this magnitude, it would not be surprising to get significant results in 12 comparisonForsterTables (Table 1).

The R-Index for the comparison of the two experimental groups in Table is…

View original post 1,322 more words

The R-Index for 18 Multiple Study Articles in Science (Francis et al., 2014)

Replicability-Index

tide_naked

“Only when the tide goes out do you discover who has been swimming naked.”  Warren Buffet (Value Investor).

Francis, Tanzman, and Matthews (2014) examined the credibility of psychological articles published in the prestigious journal Science. They focused on articles that contained four or more articles because (a) the statistical test that they has insufficient power for smaller sets of studies and (b) the authors assume that it is only meaningful to focus on studies that are published within a single article.

They found 26 articles published between 2006 and 2012. Eight articles could not be analyzed with their method.

The remaining 18 articles had a 100% success rate. That is, they never reported that a statistical hypothesis test failed to produce a significant result. Francis et al. computed the probability of this outcome for each article. When the probability was less than 10%, they made the recommendation to be…

View original post 1,002 more words

Bayesian Statistics in Small Samples: Replacing Prejudice against the Null-Hypothesis with Prejudice in Favor of the Null-Hypothesis

when Uri and Uli agree, it must be true. 🙂
concern about correct interpretation of Bayesian statistics

Replicability-Index

Matzke, Nieuwenhuis, van Rijn, Slagter, van der Molen, and Wagenmakers (2015) published the results of a preregistered adversarial collaboration. This article has been considered a model of conflict resolution among scientists.

The study examined the effect of eye-movements on memory. Drs. Nieuwenhuis and Slagter assume that horizontal eye-movements improve memory. Drs. Matzke, van Rijn, and Wagenmakers did not believe that horizontal-eye movements improve memory. That is, they assumed the null-hypothesis to be true. Van der Molen acted as a referee to resolve conflict about procedural questions (e.g., should some participants be excluded from analysis?).

The study was a between-subject design with three conditions: horizontal eye movements, vertical eye movements, and no eye movement.

The researchers collected data from 81 participants and agreed to exclude 2 participants, leaving 79 participants for analysis. As a result there were 27 or 26 participants per condition.

The hypothesis that horizontal eye-movements improve performance…

View original post 766 more words

Meta-Analysis of Observed Power: Comparison of Estimation Methods

Meta-Analysis of Observed Power

Citation: Dr. R (2015). Meta-analysis of observed power. R-Index Bulletin, Vol(1), A2.

In a previous blog post, I presented an introduction to the concept of observed power. Observed power is an estimate of the true power on the basis of observed effect size, sampling error, and significance criterion of a study. Yuan and Maxwell (2005) concluded that observed power is a useless construct when it is applied to a single study, mainly because sampling error in a single study is too large to obtain useful estimates of true power. However, sampling error decreases as the number of studies increases and observed power in a set of studies can provide useful information about the true power in a set of studies.

This blog post introduces various methods that can be used to estimate power on the basis of a set of studies (meta-analysis). I then present simulation studies that compare the various estimation methods in terms of their ability to estimate true power under a variety of conditions. In this blog post, I examine only unbiased sets of studies. That is, the sample of studies in a meta-analysis is a representative sample from the population of studies with specific characteristics. The first simulation assumes that samples are drawn from a population of studies with fixed effect size and fixed sampling error. As a result, all studies have the same true power (homogeneous). The second simulation assumes that all studies have a fixed effect size, but that sampling error varies across studies. As power is a function of effect size and sampling error, this simulation models heterogeneity in true power. The next simulations assume heterogeneity in population effect sizes. One simulation uses a normal distribution of effect sizes. Importantly, a normal distribution has no influence on the mean because effect sizes are symmetrically distributed around the mean effect size. The next simulations use skewed normal distributions. This simulation provides a realistic scenario for meta-analysis of heterogeneous sets of studies such as a meta-analysis of articles in a specific journal or articles on different topics published by the same author.

Observed Power Estimation Method 1: The Percentage of Significant Results

The simplest method to determine observed power is to compute the percentage of significant results. As power is defined as the long-range percentage of significant results, the percentage of significant results in a set of studies is an unbiased estimate of the long-term percentage. The main limitation of this method is that the dichotomous measure (significant versus insignificant) is likely to be imprecise when the number of studies is small. For example, two studies can only show observed power values of 0, 25%, 50%, or 100%, even if true power were 75%. However, the percentage of significant results plays an important role in bias tests that examine whether a set of studies is representative. When researchers hide non-significant results or use questionable research methods to produce significant results, the percentage of significant results will be higher than the percentage of significant results that could have been obtained on the basis of the actual power to produce significant results.

Observed Power Estimation Method 2: The Median

Schimmack (2012) proposed to average observed power of individual studies to estimate observed power. Yuan and Maxwell (2005) demonstrated that the average of observed power is a biased estimator of true power. It overestimates power when power is less than 50% and it underestimates true power when power is above 50%. Although the bias is not large (no more than 10 percentage points), Yuan and Maxwell (2005) proposed a method that produces an unbiased estimate of power in a meta-analysis of studies with the same true power (exact replication studies). Unlike the average that is sensitive to skewed distributions, the median provides an unbiased estimate of true power because sampling error is equally likely (50:50 probability) to inflate or deflate the observed power estimate. To avoid the bias of averaging observed power, Schimmack (2014) used median observed power to estimate the replicability of a set of studies.

Observed Power Estimation Method 3: P-Curve’s KS Test

Another method is implemented in Simonsohn’s (2014) pcurve. Pcurve was developed to obtain an unbiased estimate of a population effect size from a biased sample of studies. To achieve this goal, it is necessary to determine the power of studies because bias is a function of power. The pcurve estimation uses an iterative approach that tries out different values of true power. For each potential value of true power, it computes the location (quantile) of observed test statistics relative to a potential non-centrality parameter. The best fitting non-centrality parameter is located in the middle of the observed test statistics. Once a non-central distribution has been found, it is possible to assign each observed test-value a cumulative percentile of the non-central distribution. For the actual non-centrality parameter, these percentiles have a uniform distribution. To find the best fitting non-centrality parameter from a set of possible parameters, pcurve tests whether the distribution of observed percentiles follows a uniform distribution using the Kolmogorov-Smirnov test. The non-centrality parameter with the smallest test statistics is then used to estimate true power.

Observed Power Estimation Method 4: P-Uniform

van Assen, van Aert, and Wicherts (2014) developed another method to estimate observed power. Their method is based on the use of the gamma distribution. Like the pcurve method, this method relies on the fact that observed test-statistics should follow a uniform distribution when a potential non-centrality parameter matches the true non-centrality parameter. P-uniform transforms the probabilities given a potential non-centrality parameter with a negative log-function (-log[x]). These values are summed. When probabilities form a uniform distribution, the sum of the log-transformed probabilities matches the number of studies. Thus, the value with the smallest absolute discrepancy between the sum of negative log-transformed percentages and the number of studies provides the estimate of observed power.

Observed Power Estimation Method 5: Averaging Standard Normal Non-Centrality Parameter

In addition to these existing methods, I introduce to novel estimation methods. The first new method converts observed test statistics into one-sided p-values. These p-values are then transformed into z-scores. This approach has a long tradition in meta-analysis that was developed by Stouffer et al. (1949). It was popularized by Rosenthal during the early days of meta-analysis (Rosenthal, 1979). Transformation of probabilities into z-scores makes it easy to aggregate probabilities because z-scores follow a symmetrical distribution. The average of these z-scores can be used as an estimate of the actual non-centrality parameter. The average z-score can then be used to estimate true power. This approach avoids the problem of averaging power estimates that power has a skewed distribution. Thus, it should provide an unbiased estimate of true power when power is homogenous across studies.

Observed Power Estimation Method 6: Yuan-Maxwell Correction of Average Observed Power

Yuan and Maxwell (2005) demonstrated a simple average of observed power is systematically biased. However, a simple average avoids the problems of transforming the data and can produce tighter estimates than the median method. Therefore I explored whether it is possible to apply a correction to the simple average. The correction is based on Yuan and Maxwell’s (2005) mathematically derived formula for systematic bias. After averaging observed power, Yuan and Maxwell’s formula for bias is used to correct the estimate for systematic bias. The only problem with this approach is that bias is a function of true power. However, as observed power becomes an increasingly good estimator of true power in the long run, the bias correction will also become increasingly better at correcting the right amount of bias.

The Yuan-Maxwell correction approach is particularly promising for meta-analysis of heterogeneous sets of studies such as sets of diverse studies in a journal. The main advantage of this method is that averaging of power makes no assumptions about the distribution of power across different studies (Schimmack, 2012). The main limitation of averaging power was the systematic bias, but Yuan and Maxwell’s formula makes it possible to reduce this systematic bias, while maintaining the advantage of having a method that can be applied to heterogeneous sets of studies.

RESULTS

Homogeneous Effect Sizes and Sample Sizes

The first simulation used 100 effect sizes ranging from .01 to 1.00 and 50 sample sizes ranging from 11 to 60 participants per condition (Ns = 22 to 120), yielding 5000 different populations of studies. The true power of these studies was determined on the basis of the effect size, sample size, and the criterion p < .025 (one-tailed), which is equivalent to .05 (two-tailed). Sample sizes were chosen so that average power across the 5,000 studies was 50%. The simulation drew 10 random samples from each of the 5,000 populations of studies. Each sample of a study simulated a between-subject design with the given population effect size and sample size. The results were stored as one-tailed p-values. For the meta-analysis p-values were converted into z-scores. To avoid biases due to extreme outliers, z-scores greater than 5 were set to 5 (observed power = .999).

The six estimation methods were then used to compute observed power on the basis of samples of 10 studies. The following figures show observed power as a function of true power. The green lines show the 95% confidence interval for different levels of true power. The figure also includes red dashed lines for a value of 50% power. Studies with more than 50% observed power would be significant. Studies with less than 50% observed power would be non-significant. The figures also include a blue line for 80% true power. Cohen (1988) recommended that researchers should aim for a minimum of 80% power. It is instructive how accurate estimation methods are in evaluating whether a set of studies met this criterion.

The histogram shows the distribution of true power across the 5,000 populations of studies.

The histogram shows YMCA fig1that the simulation covers the full range of power. It also shows that high-powered studies are overrepresented because moderate to large effect sizes can achieve high power for a wide range of sample sizes. The distribution is not important for the evaluation of different estimation methods and benefits all estimation methods equally because observed power is a good estimator of true power when true power is close to the maximum (Yuan & Maxwell, 2005).

The next figure shows scatterplots of observed power as a function of true power. Values above the diagonal indicate that observed power overestimates true power. Values below the diagonal show that observed power underestimates true power.

YMCA fig2

Visual inspection of the plots suggests that all methods provide unbiased estimates of true power. Another observation is that the count of significant results provides the least accurate estimates of true power. The reason is simply that aggregation of dichotomous variables requires a large number of observations to approximate true power. The third observation is that visual inspection provides little information about the relative accuracy of the other methods. Finally, the plots show how accurate observed power estimates are in meta-analysis of 10 studies. When true power is 50%, estimates very rarely exceed 80%. Similarly, when true power is above 80%, observed power is never below 50%. Thus, observed power can be used to examine whether a set of studies met Cohen’s recommended guidelines to conduct studies with a minimum of 80% power. If observed power is 50%, it is nearly certain that the studies did not have the recommended 80% power.

To examine the relative accuracy of different estimation methods quantitatively, I computed bias scores (observed power – true power). As bias can overestimate and underestimate true power, the standard deviation of these bias scores can be used to quantify the precision of various estimation methods. In addition, I present the mean to examine whether a method has large sample accuracy (i.e. the bias approaches zero as the number of simulations increases). I also present the percentage of studies with no more than 20% points bias. Although 20% bias may seem large, it is not important to estimate power with very high precision. When observed power is below 50%, it suggests that a set of studies was underpowered even if the observed power estimate is an underestimation.

The quantitatiYMCA fig12ve analysis also shows no meaningful differences among the estimation methods. The more interesting question is how these methods perform under more challenging conditions when the set of studies are no longer exact replication studies with fixed power.

Homogeneous Effect Size, Heterogeneous Sample Sizes

The next simulation simulated variation in sample sizes. For each population of studies, sample sizes were varied by multiplying a particular sample size by factors of 1 to 5.5 (1.0, 1.5,2.0…,5.5). Thus, a base-sample-size of 40 created a range of sample sizes from 40 to 220. A base-sample size of 100 created a range of sample sizes from 100 to 2,200. As variation in sample sizes increases the average sample size, the range of effect sizes was limited to a range from .004 to .4 and effect sizes were increased in steps of d = .004. The histogram shows the distribution of power in the 5,000 population of studies.

YMCA fig4

The simulation covers the full range of true power, although studies with low and very high power are overrepresented.

The results are visually not distinguishable from those in the previous simulation.

YMCA fig5

The quantitative comparison of the estimation methods also shows very similar results.

YMCA fig6

In sum, all methods perform well even when true power varies as a function of variation in sample sizes. This conclusion may not generalize to more extreme simulations of variation in sample sizes, but more extreme variations in sample sizes would further increase the average power of a set of studies because the average sample size would increase as well. Thus, variation in effect sizes poses a more realistic challenge for the different estimation methods.

Heterogeneous, Normally Distributed Effect Sizes

The next simulation used a random normal distribution of true effect sizes. Effect sizes were simulated to have a reasonable but large variation. Starting effect sizes ranged from .208 to 1.000 and increased in increments of .008. Sample sizes ranged from 10 to 60 and increased in increments of 2 to create 5,000 populations of studies. For each population of studies, effect sizes were sampled randomly from a normal distribution with a standard deviation of SD = .2. Extreme effect sizes below d = -.05 were set to -.05 and extreme effect sizes above d = 1.20 were set to 1.20. The first histogram of effect sizes shows the 50,000 population effect sizes. The histogram on the right shows the distribution of true power for the 5,000 sets of 10 studies.

YMCA fig7

The plots of observed and true power show that the estimation methods continue to perform rather well even when population effect sizes are heterogeneous and normally distributed.

YMCA fig9

The quantitative comparison suggests that puniform has some problems with heterogeneity. More detailed studies are needed to examine whether this is a persistent problem for puniform, but given the good performance of the other methods it seems easier to use these methods.

YMCA fig8

Heterogeneous, Skewed Normal Effect Sizes

The next simulation puts the estimation methods to a stronger challenge by introducing skewed distributions of population effect sizes. For example, a set of studies may contain mostly small to moderate effect sizes, but a few studies examined large effect sizes. To simulated skewed effect size distributions, I used the rsnorm function of the fGarch package. The function creates a random distribution with a specified mean, standard deviation, and skew. I set the mean to d = .2, the standard deviation to SD = .2, and skew to 2. The histograms show the distribution of effect sizes and the distribution of true power for the 5,000 sets of studies (k = 10).

YMCA fig10

This time the results show differences between estimation methods in the ability of various estimation methods to deal with skewed heterogeneity. The percentage of significant results is unbiased, but is imprecise due to the problem of averaging dichotomous variables. The other methods show systematic deviations from the 95% confidence interval around the true parameter. Visual inspection suggests that the Yuan-Maxwell correction method has the best fit.

YMCA fig11

This impression is confirmed in quantitative analyses of bias. The quantitative comparison confirms major problems with the puniform estimation method. It also shows that the median, p-curve, and the average z-score method have the same slight positive bias. Only the Yuan-Maxwell corrected average power shows little systematic bias.

YMCA fig12

To examine biases in more detail, the following graphs plot bias as a function of true power. These plots can reveal that a method may have little average bias, but has different types of bias for different levels of power. The results show little evidence of systematic bias for the Yuan-Maxwell corrected average of power.

YMCA fig13

The following analyses examined bias separately for simulation with less or more than 50% true power. The results confirm that all methods except the Yuan-Maxwell correction underestimate power when true power is below 50%. In contrast, most estimation methods overestimate true power when true power is above 50%. The exception is puniform which still underestimated true power. More research needs to be done to understand the strange performance of puniform in this simulation. However, even if p-uniform could perform better, it is likely to be biased with skewed distributions of effect sizes because it assumes a fixed population effect size.

YMCA fig14

Conclusion

This investigation introduced and compared different methods to estimate true power for a set of studies. All estimation methods performed well when a set of studies had the same true power (exact replication studies), when effect sizes were homogenous and sample sizes varied, and when effect sizes were normally distributed and sample sizes were fixed. However, most estimation methods were systematically biased when the distribution of effect sizes was skewed. In this situation, most methods run into problems because the percentage of significant results is a function of the power of individual studies rather than the average power.

The results of these analyses suggest that the R-Index (Schimmack, 2014) can be improved by simply averaging power and then applying the Yuan-Maxwell correction. However, it is important to realize that the median method tends to overestimate power when power is greater than 50%. This makes it even more difficult for the R-Index to produce an estimate of low power when power is actually high. The next step in the investigation of observed power is to examine how different methods perform in unrepresentative (biased) sets of studies. In this case, the percentage of significant results is highly misleading. For example, Sterling et al. (1995) found percentages of 95% power, which would suggest that studies had 95% power. However, publication bias and questionable research practices create a bias in the sample of studies that are being published in journals. The question is whether other observed power estimates can reveal bias and can produce accurate estimates of the true power in a set of studies.

An Introduction to Observed Power based on Yuan and Maxwell (2005)

Yuan, K.-H., & Maxwell, S. (2005). On the Post Hoc Power in Testing Mean Differences. Journal of Educational and Behavioral Statistics, 141–167

This blog post provides an accessible introduction to the concept of observed power. Most of the statistical points are based on based on Yuan and Maxwell’s (2005 excellent but highly technical article about post-hoc power. This bog post tries to explain statistical concepts in more detail and uses simulation studies to illustrate important points.

What is Power?

Power is defined as the long-run probability of obtaining significant results in a series of exact replication studies. For example, 50% power means that a set of 100 studies is expected to produce 50 significant results and 50 non-significant results. The exact numbers in an actual set of studies will vary as a function of random sampling error, just like 100 coin flips are not always going to produce a 50:50 split of heads and tails. However, as the number of studies increases, the percentage of significant results will be ever closer to the power of a specific study.

A priori power

Power analysis can be useful for the planning of sample sizes before a study is being conducted. A power analysis that is being conducted before a study is called a priori power analysis (before = a priori). Power is a function of three parameters: the actual effect size, sampling error, and the criterion value that needs to be exceeded to claim statistical significance.   In between-subject designs, sampling error is determined by sample size alone. In this special case, power is a function of the true effect size, the significance criterion and sample size.

The problem for researchers is that power depends on the effect size in the population (e.g., the true correlation between height and weight amongst Canadians in 2015). The population effect size is sometimes called the true effect size. Imagine that somebody would actually obtain data from everybody in a population. In this case, there is no sampling error and the correlation is the true correlation in the population. However, typically researchers use much smaller samples and the goal is to estimate the correlation in the population on the basis of a smaller sample. Unfortunately, power depends on the correlation in the population, which is unknown to a researcher planning a study. Therefore, researchers have to estimate the true effect size to compute an a priori power analysis.

Cohen (1988) developed general guidelines for the estimation of effect sizes.   For example, in studies that compare the means of two groups, a standardized difference of half a standard deviation (e.g., 7.5 IQ points on an iQ scale with a standard deviation of 15) is considered a moderate effect.   Researchers who assume that their predicted effect has a moderate effect size, can use d = .5 for an a priori power analysis. Assuming that they want to claim significance with the standard criterion of p < .05 (two-tailed), they would need N = 210 (n =105 per group) to have a 95% chance to obtain a significant result (GPower). I do not discuss a priori power analysis further because this blog post is about observed power. I merely introduced a priori power analysis to highlight the difference between a priori power analysis and a posteriori power analysis, which is the main topic of Yuan and Maxwell’s (2005) article.

A Posteriori Power Analysis: Observed Power

Observed power computes power after a study or several studies have been conducted. The key difference between a priori and a posteriori power analysis is that a posteriori power analysis uses the observed effect size in a study as an estimate of the population effect size. For example, assume a researcher found a correlation of r = .54 in a sample of N = 200 Canadians. Instead of guessing the effect size, the researcher uses the correlation observed in this sample as an estimate of the correlation in the population. There are several reasons why it might be interesting to conduct a power analysis after a study. First, the power analysis might be used to plan a follow up or replication study. Second, the power analysis might be used to examine whether a non-significant result might be the result of insufficient power. Third, observed power is used to examine whether a researcher used questionable research practices to produce significant results in studies that had insufficient power to produce significant results.

In sum, observed power is an estimate of the power of a study based on the observed effect size in a study. It is therefore not power that is being observed, but the effect size that is being observed. However, because the other parameters that are needed to compute power are known (sample size, significance criterion), the observed effect size is the only parameter that needs to be observed to estimate power. However, it is important to realize that observed power does not mean that power was actually observed. Observed power is still an estimate based on an observed effect size because power depends on the effect size in the population (which remains unobserved) and the observed effect size in a sample is just an estimate of the population effect size.

A Posteriori Power Analysis after a Single Study

Yuan and Maxwell (2005) examined the statistical properties of observed power. The main question was whether it is meaningful to compute observed power based on the observed effect size in a single study.

The first statistical analysis of an observed mean difference is to examine whether the study produced a significant result. For example, the study may have examined whether music lessons produce an increase in children’s IQ.   The study had 95% power to produce a significant difference with N = 176 participants and a moderate effect size (d = .5; IQ = 7.5).

One possibility is that the study actually produced a significant result.   For example, the observed IQ difference was 5 IQ points. This is less than the expected difference of 7.5 points and corresponds to a standardized effect size of d = .3. Yet, the t-test shows a highly significant difference between the two groups, t(208) = 3.6, p = 0.0004 (1 / 2513). The p-value shows that random sampling error alone would produce differences of this magnitude or more in only 1 out of 2513 studies. Importantly, the p-value only makes it very likely that the intervention contributed to the mean difference, but it does not provide information about the size of the effect. The true effect size may be closer to the expected effect size of 7.5 or it may be closer to 0. The true effect size remains unknown even after the mean difference between the two groups is observed. Yet, the study provides some useful information about the effect size. Whereas the a priori power analysis relied exclusively on guess-work, observed power uses the effect size that was observed in a reasonably large sample of 210 participants. Everything else being equal, effect size estimates based on 210 participants are more likely to match the true effect size than those based on 0 participants.

The observed effect size can be entered into a power analysis to compute observed power. In this example, observed power with an effect size of d = .3 and N = 210 (n = 105 per group) is 58%.   One question examined by Yuan and Maxwell (2005) is whether it can be useful to compute observed power after a study produced a significant result.

The other question is whether it can be useful to compute observed power when a study produced a non-significant result.   For example, assume that the estimate of d = 5 is overly optimistic and that the true effect size of music lessons on IQ is a more modest 1.5 IQ points (d = .10, one-tenth of a standard deviation). The actual mean difference that is observed after the study happens to match the true effect size exactly. The difference between the two groups is not statistically significant, t(208) = .72, p = .47. A non-significant result is difficult to interpret. On the one hand, the means trend in the right direction. On the other hand, the mean difference is not statistically significant. The p-value suggests that a mean difference of this magnitude would occur in every second study by chance alone even if music intervention had no effect on IQ at all (i.e., the true effect size is d = 0, the null-hypothesis is true). Statistically, the correct conclusion is that the study provided insufficient information regarding the influence of music lessons on IQ.   In other words, assuming that the true effect size is closer to the observed effect size in a sample (d = .1) than to the effect size that was used to plan the study (d = .5), the sample size was insufficient to produce a statistically significant result. Computing observed power merely provides some quantitative information to reinforce this correct conclusion. An a posteriori power analysis with d = .1 and N = 210, yields an observed power of 11%.   This suggests that the study had insufficient power to produce a significant result, if the effect size in the sample matches the true effect size.

Yuan and Maxwell (2005) discuss false interpretations of observed power. One false interpretation is that a significant result implies that a study had sufficient power. Power is a function of the true effect size and observed power relies on effect sizes in a sample. 50% of the time, effect sizes in a sample overestimate the true effect size and observed power is inflated. It is therefore possible that observed power is considerably higher than the actual power of a study.

Another false interpretation is that low power in a study with a non-significant result means that the hypothesis is correct, but that the study had insufficient power to demonstrate it.   The problem with this interpretation is that there are two potential reasons for a non-significant result. One of them, is that a study had insufficient power to show a significant result when an effect is actually present (this is called the type-II error).   The second possible explanation is that the null-hypothesis is actually true (there is no effect). A non-significant result cannot distinguish between these two explanations. Yet, it remains true that the study had insufficient power to test these hypotheses against each other. Even if a study had 95% power to show an effect if the true effect size is d = .5, it can have insufficient power if the true effect size is smaller. In the example, power decreased from 95% assuming d = .5, to 11% assuming d = .1.

Yuan and Maxell’s Demonstration of Systematic Bias in Observed Power

Yuan and Maxwell focus on a design in which a sample mean is compared against a population mean and the standard deviation is known. To modify the original example, a researcher could recruit a random sample of children, do a music lesson intervention and test the IQ after the intervention against the population mean of 100 with the population standard deviation of 15, rather than relying on the standard deviation in a sample as an estimate of the standard deviation. This scenario has some advantageous for mathematical treatments because it uses the standard normal distribution. However, all conclusions can be generalized to more complex designs. Thus, although Yuan and Maxwell focus on an unusual design, their conclusions hold for more typical designs such as the comparison of two groups that use sample variances (standard deviations) to estimate the variance in a population (i.e., pooling observed variances in both groups to estimate the population variance).

Yuan and Maxwell (2005) also focus on one-tailed tests, although the default criterion in actual studies is a two-tailed test. Once again, this is not a problem for their conclusions because the two-tailed criterion value for p = .05 is equivalent to the one-tailed criterion value for p = .025 (.05 / 2). For the standard normal distribution, the value is z = 1.96. This means that an observed z-score has to exceed a value of 1.96 to be considered significant.

To illustrate this with an example, assume that the IQ of 100 children after a music intervention is 103. After subtracting the population mean of 100 and dividing by the standard deviation of 15, the effect size is d = 3/15 = .2. Sampling error is defined by 1 / sqrt (n). With a sample size of n = 100, sampling error is .10. The test-statistic (z) is the ratio of the effect size and sampling error (.2 / .1) = 2. A z-score of 2 is just above the critical value of 2, and would produce a significant result, z = 2, p = .023 (one-tailed; remember criterion is .025 one-tailed to match .05 two-tailed).   Based on this result, a researcher would be justified to reject the null-hypothesis (there is no effect of the intervention) and to claim support for the hypothesis that music lessons lead to an increase in IQ. Importantly, this hypothesis makes no claim about the true effect size. It merely states that the effect is greater than zero. The observed effect size in the sample (d = .2) provides an estimate of the actual effect size but the true effect size can be smaller or larger than the effect size in the sample. The significance test merely rejects the possibility that the effect size is 0 or less (i.e., music lessons lower IQ).

YM formula1

Entering a non-centrality parameter of 3 for a generic z-test in G*power yields the following illustration of  a non-central distribution.

YM figure1

Illustration of non-central distribution using G*Power output

The red curve shows the standard normal distribution for the null-hypothesis. With d = 0, the non-centrality parameter is also 0 and the standard normal distribution is centered over zero.

The blue curve shows the non-central distribution. It is the same standard normal distribution, but now it is centered over z = 3.   The distribution shows how z-scores would be distributed for a set of exact replication studies, where exact replication studies are defined as studies with the same true effect size and sampling error.

The figure also illustrates power by showing the critical z-score of 1.96 with a green line. On the left side are studies where sampling error reduced the observed effect size so much that the z-score was below 1.96 and produced a non-significant result (p > .025 one-tailed, p > .05, two-tailed). On the right side are studies with significant results. The area under the curve on the left side is called type-II error or beta-error). The area under the curve on the right side is called power (1 – type-II error).   The output shows that beta error probability is 15% and Power is 85%.

YM formula2

In sum, the formulaYM formula3

states that power for a given true effect size is the area under the curve to the right side of a critical z-score for a standard normal distribution that is centered over the non-centrality parameter that is defined by the ratio of the true effect size over sampling error.

[personal comment: I find it odd that sampling error is used on the right side of the formula but not on the left side of the formula. Power is a function of the non-centrality parameter and not just the effect size. Thus I would have included sqrt (n) also on the left side of the formula].

Because the formula relies on the true effect size, it specifies true power given the (unknown) population effect size. To use it for observed power, power has to be estimated based on the observed effect size in a sample.

The important novel contribution of Yuan and Maxwell (2005) was to develop a mathematical formula that relates observed power to true power and to find a mathematical formula for the bias in observed power.

YM formula4

The formula implies that the amount of bias is a function of the unknown population effect size. Yuan and Maxwell make several additional observations about bias. First, bias is zero when true power is 50%.   The second important observation is that systematic bias is never greater than 9 percentage points. The third observation is that power is overestimated when true power is less than 50% and underestimated when true power is above 50%. The last observation has important implications for the interpretation of observed power.

50% power implies that the test statistic matches the criterion value. For example, if the criterion is p < .05 (two-tailed), 50% power is equivalent to p = .05.   If observed power is less than 50%, a study produced a non-significant result. A posteriori power analysis might suggest that observed power is only 40%. This finding suggests that the study was underpowered and that a more powerful study might produce a significant result.   Systematic bias implies that the estimate of 40% is more likely to be an overestimation than an underestimation. As a result, bias does not undermine the conclusion. Rather observed power is conservative because the actual power is likely to be even less than 40%.

The alternative scenario is that observed power is greater than 50%, which implies a significant result. In this case, observed power might be used to argue that a study had sufficient power because it did produce a significant result. Observed power might show, however, that observed power is only 60%. This would indicate that there was a relatively high chance to end up with a non-significant result. However, systematic bias implies that observed power is more likely to underestimate true power than to overestimate it. Thus, true power is likely to be higher. Again, observed power is conservative when it comes to the interpretation of power for studies with significant results. This would suggest that systematic bias is not a serious problem for the use of observed power. Moreover, the systematic bias is never more than 9 percentage-points. Thus, observed power of 60% cannot be systematically inflated to more than 70%.

In sum, Yuan and Maxwell (2005) provided a valuable analysis of observed power and demonstrated analytically the properties of observed power.

Practical Implications of Yuan and Maxwell’s Findings

Based on their analyses, Yuan and Maxwell (2005) draw the following conclusions in the abstract of their article.

Using analytical, numerical, and Monte Carlo approaches, our results show that the estimated power does not provide useful information when the true power is small. It is almost always a biased estimator of the true power. The bias can be negative or positive. Large sample size alone does not guarantee the post hoc power to be a good estimator of the true power.

Unfortunately, other scientists often only read the abstract, especially when the article contains mathematical formulas that applied scientists find difficult to follow.   As a result, Yuan and Maxwell’s (2005) article has been cited mostly as evidence that it observed power is a useless concept. I think this conclusion is justified based on Yuan and Maxwell’s abstract, but it does not follow from Yuan and Maxwell’s formula of bias. To make this point, I conducted a simulation study that paired 25 sample sizes (n = 10 to n = 250) and 20 effect sizes (d = .05 to d = 1) to create 500 non-centrality parameters. Observed effect sizes were randomly generated for a between-subject design with two groups (df = n*2 – 2).   For each non-centrality parameter, two simulations were conducted for a total of 1000 studies with heterogeneous effect sizes and sample sizes (standard errors).   The results are presented in a scatterplot with true power on the x-axis and observed power on the y-axis. The blue line shows prediction of observed power from true power. The red curve shows the biased prediction based on Yuan and Maxwell’s bias formula.

YM figure2

The most important observation is that observed power varies widely as a function of random sampling error in the observed effect sizes. In comparison, the systematic bias is relatively small. Moreover, observed power at the extremes clearly distinguishes between low powered (< 25%) and high powered (> 80%) power. Observed power is particularly informative when it is close to the maximum value of 100%. Thus, observed power of 99% or more strongly suggests that a study had high power. The main problem for posteriori power analysis is that observed effect sizes are imprecise estimates of the true effect size, especially in small samples. The next section examines the consequences of random sampling error in more detail.

Standard Deviation of Observed Power

Awareness has been increasing that point estimates of statistical parameters can be misleading. For example, an effect size of d = .8 suggests a strong effect, but if this effect size was observed in a small sample, the effect size is strongly influenced by sampling error. One solution to this problem is to compute a confidence interval around the observed effect size. The 95% confidence interval is defined by sampling error times 1.96; approximately 2. With sampling error of .4, the confidence interval could range all the way from 0 to 1.6. As a result, it would be misleading to claim that an effect size of d = .8 in a small sample suggests that the true effect size is strong. One solution to this problem is to report confidence intervals around point estimates of effect sizes. A common confidence interval is the 95% confidence interval.   A 95% confidence interval means that there is a 95% probability that the population effect size is contained in the 95% confidence interval around the (biased) effect size in a sample.

To illustrate the use of confidence interval, I computed the confidence interval for the example of music training and IQ in children. The example assumes that the IQ of 100 children after a music intervention is 103. After subtracting the population mean of 100 and dividing by the standard deviation of 15, the effect size is d = 3/15 = .2. Sampling error is defined by 1 / sqrt (n). With a sample size of n = 100, sampling error is .10. To compute a 95% confidence interval, sampling error is multiplied with the z-scores that capture 95% of a standard normal distribution, which is 1.96.   As sampling error is .10, the values are -.196 and .196.   Given an observed effect size of d = .2, the 95% confidence interval ranges from .2 – .196 = .004 to .2 + .196 = .396.

A confidence interval can be used for significance testing by examining whether the confidence interval includes 0. If the 95% confidence interval does not include zero, it is possible to reject the hypothesis that the effect size in the population is 0, which is equivalent to rejecting the null-hypothesis. In the example, the confidence interval ends at d = .004, which implies that the null-hypothesis can be rejected. At the upper end, the confidence interval ends at d = .396. This implies that the empirical results also would reject hypotheses that the population effect size is moderate (d = .5) or strong (d = .8).

Confidence intervals around effect sizes are also useful for posteriori power analysis. Yuan and Maxwell (2005) demonstrated that confidence interval of observed power is defined by the observed power of the effect sizes that define the confidence interval of effect sizes.

YM formula5

The figure below illustrates the observed power for the lower bound of the confidence interval in the example of music lessons and IQ (d = .004).

YM figure3

The figure shows that the non-central distribution (blue) and the central distribution (red) nearly perfectly overlap. The reason is that the observed effect size (d = .004) is just slightly above the d-value of the central distribution when the effect size is zero (d = .000). When the null-hypothesis is true, power equals the type-I error rate (2.5%) because 2.5% of studies will produce a significant result by chance alone and chance is the only factor that produces significant results. When the true effect size is d = .004, power increases to 2.74 percent.

Remember that this power estimate is based on the lower limit of a 95% confidence interval around the observed power estimate of 50%.   Thus, this result means that there is a 95% probability that the true power of the study is 2.5% when observed power is 50%.

The next figure illustrates power for the upper limit of the 95% confidence interval.

YM figure4

In this case, the non-central distribution and the central distribution overlap very little. Only 2.5% of the non-central distribution is on the left side of the criterion value, and power is 97.5%.   This finding means that there is a 95% probability that true power is not greater than 97.5% when observed power is 50%.

Taken these results together, the results show that the 95% confidence interval around an observed power estimate of 50% ranges from 2.5% to 97.5%.   As this interval covers pretty much the full range of possible values, it follows that observed power of 50% in a single study provides virtually no information about the true power of a study. True power can be anywhere between 2.5% and 97.5% percent.

The next figure illustrates confidence intervals for different levels of power.

YM figure5

The data are based on the same simulation as in the previous simulation study. The green line is based on computation of observed power for the d-values that correspond to the 95% confidence interval around the observed (simulated) d-values.

The figure shows that confidence intervals for most observed power values are very wide. The only accurate estimate of observed power can be achieved when power is high (upper right corner). But even 80% true power still has a wide confidence interval where the lower bound is below 20% observed power. Firm conclusions can only be drawn when observed power is high.

For example, when observed power is 95%, a one-sided 95% confidence interval (guarding only against underestimation) has a lower bound of 50% power. This finding would imply that observing power of 95% justifies the conclusion that the study had at least 50% power with an error rate of 5% (i.e., in 5% of the studies the true power is less than 50%).

The implication is that observed power is useless unless observed power is 95% or higher.

In conclusion, consideration of the effect of random sampling error on effect size estimates provides justification for Yuan and Maxwell’s (2005) conclusion that computation of observed power provides relatively little value.   However, the reason is not that observed power is a problematic concept. The reason is that observed effect sizes in underpowered studies provide insufficient information to estimate observed power with any useful degree of accuracy. The same holds for the reporting of observed effect sizes that are routinely reported in research reports and for point estimates of effect sizes that are interpreted as evidence for small, moderate, or large effects. None of these statements are warranted when the confidence interval around these point estimates is taken into account. A study with d = .80 and a confidence interval of d = .01 to 1.59 does not justify the conclusion that a manipulation had a strong effect because the observed effect size is largely influenced by sampling error.

In conclusion, studies with large sampling error (small sample sizes) are at best able to determine the sign of a relationship. Significant positive effects are likely to be positive and significant negative effects are likely to be negative. However, the effect sizes in these studies are too strongly influenced by sampling error to provide information about the population effect size and therewith about parameters that depend on accurate estimation of population effect sizes like power.

Meta-Analysis of Observed Power

One solution to the problem of insufficient information in a single underpowered study is to combine the results of several underpowered studies in a meta-analysis.   A meta-analysis reduces sampling error because sampling error creates random variation in effect size estimates across studies and aggregation reduces the influence of random factors. If a meta-analysis of effect sizes can produce more accurate estimates of the population effect size, it would make sense that meta-analysis can also increase the accuracy of observed power estimation.

Yuan and Maxwell (2005) discuss meta-analysis of observed power only briefly.

YM figure6

A problem in a meta-analysis of observed power is that observed power is not only subject to random sampling error, but also systematically biased. As a result, the average of observed power across a set of studies would also be systematically biased.   However, the reason for the systematic bias is the non-symmetrical distribution of observed power when power is not 50%.   To avoid this systematic bias, it is possible to compute the median. The median is unbiased because 50% of the non-central distribution is on the left side of the non-centrality parameter and 50% is on the right side of the non-centrality parameter. Thus, the median provides an unbiased estimate of the non-centrality parameter and the estimate becomes increasingly accurate as the number of studies in a meta-analysis increases.

The next figure shows the results of a simulation with the same 500 studies (25 sample sizes and 20 effect sizes) that were simulated earlier, but this time each study was simulated to be replicated 1,000 times and observed power was estimated by computing the average or the median power across the 1,000 exact replication studies.

YM figure7

Purple = average observed power;   Orange = median observed power

The simulation shows that Yuan and Maxwell’s (2005) bias formula predicts the relationship between true power and the average of observed power. It also confirms that the median is an unbiased estimator of true power and that observed power is a good estimate of true power when the median is based on a large set of studies. However, the question remains whether observed power can estimate true power when the number of studies is smaller.

The next figure shows the results for a simulation where estimated power is based on the median observed power in 50 studies. The maximum discrepancy in this simulation was 15 percentage points. This is clearly sufficient to distinguish low powered studies (<50% power) from high powered studies (>80%).

YM figure8

To obtain confidence intervals for median observed power estimates, the power estimate can be converted into the corresponding non-centrality parameter of a standard normal distribution. The 95% confidence interval is defined as the standard deviation divided by the square root of the number of studies. The standard deviation of a standard normal distribution equals 1. Hence, the 95% confidence interval for a set of studies is defined by

Lower Limit = Normal (InverseNormal (power) – 1.96 / sqrt(k))

Upper Limit = Normal (inverseNormal(power) + 1.96 / sqrt(k))

Interestingly, the number of observations in a study is irrelevant. The reason is that larger samples produce smaller confidence intervals around an effect size estimate and increase power at the same time. To hold power constant, the effect size has to decrease and power decreases exponentially as effect sizes decrease. As a result, observed power estimates do not become more precise when sample sizes increase and effect sizes decrease proportionally.

The next figure shows simulated data for 1000 studies with 20 effect sizes (0.05 to 1) and 25 sample sizes (n = 10 to 250). Each study was repeated 50 times and the median value was used to estimate true power. The green lines are the 95% confidence interval around the true power value.   In real data, the confidence interval would be drawn around observed power, but observed power does not provide a clear mathematical function. The 95% confidence interval around the true power values is still useful because it predicts how much observed power estimates can deviate from true power. 95% of observed power values are expected to be within the area that is defined by lower and upper bound of the confidence interval. The Figure shows that most values are within the area. This confirms that sampling error in a meta-analysis of observed power is a function of the number of studies. The figure also shows that sampling error is greatest when power is 50%. In the tails of the distribution range restriction produces more precise estimates more quickly.

YM figure9

With 50 studies, the maximum absolute discrepancy is 15 percentage points. This level of precision is sufficient to draw broad conclusions about the power of a set of studies. For example, any median observed power estimate below 65% is sufficient to reveal that a set of studies had less power than Cohen’s recommended level of 80% power. A value of 35% would strongly suggest that a set of studies was severely underpowered.

Conclusion

Yuan and Maxwell (2005) provided a detailed statistical examination of observed power. They concluded that observed power typically provides little to no useful information about the true power of a single study. The main reason for this conclusion was that sampling error in studies with low power is too large to estimate true power with sufficient precision. The only precise estimate of power can be obtained when sampling error is small and effect sizes are large. In this case, power is near the maximum value of 1 and observed power correctly estimates true power as being close to 1. Thus, observed power can be useful when it suggests that a study had high power.

Yuan and Maxwell’s (2005) also showed that observed power is systematically biased unless true power is 50%. The amount of bias is relatively small and even without this systematic bias, the amount of random error is so large that observed power estimates based on a single study cannot be trusted.

Unfortunately, Yuan and Maxwell’s (2005) article has been misinterpreted as evidence that observed power calculations are inherently biased and useless. However, observed power can provide useful and unbiased information in a meta-analysis of several studies. First, a meta-analysis can provide unbiased estimates of power because the median value is an unbiased estimator of power. Second, aggregation across studies reduces random sampling error, just like aggregation across studies reduces sampling error in meta-analyses of effect sizes.

Implications

The demonstration that median observed power provides useful information about true power is important because observed power has become a valuable tool in the detection of publication bias and other biases that lead to inflated estimates of effect sizes. Starting with Sterling, Rosenbaum, and Weinkam ‘s(1995) seminal article, observed power has been used by Ioannidis and Trikalinos (2007), Schimmack (2012), Francis (2012), Simonsohn (2014), and van Assen, van Aert, and Wicherts (2014) to draw inferences about a set of studies with the help of posteriori power analysis. The methods differ in the way observed data are used to estimate power, but they all rely on the assumption that observed data provide useful information about the true power of a set of studies. This blog post shows that Yuan and Maxwell’s (2005) critical examination of observed power does not undermine the validity of statistical approaches that rely on observed data to estimate power.

Future Directions

This blog post focussed on meta-analysis of exact replication studies that have the same population effect size and the same sample size (sampling error). It also assumed that the set of studies is a representative set of studies. An important challenge for future research is to examine the statistical properties of observed power when power varies across studies (heterogeneity) and when publication bias and other biases are present. A major limitation of existing methods is that these methods assume a fixed population effect size (Ioannidis and Trikalinos (2007), Francis (2012), Simonsohn (2014), and van Assen, van Aert, and Wicherts (2014). At present, the Incredibility index (Schimmack, 2012) and the R-Index (Schimmack, 2014) have been proposed as methods for sets of studies that are biased and heterogeneous. An important goal for future research is to evaluate these methods in simulation studies with heterogeneous and biased sets of data.