All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

When Exact Replications Are Too Exact: The Lucky-Bounce-Test for Pairs of Exact Replication Studies

Imagine an NBA player has an 80% chance to make one free throw. What is the chance that he makes both free throws? The correct answer is 64% (80% * 80%).

Now consider the possibility that it is possible to distinguish between two types of free throws. Some free throws are good; they don’t touch the rim and make a swishing sound when they go through the net (all net). The other free throws bounce of the rim and go in (rattling in).

What is the probability that an NBA player with an 80% free throw percentage makes a free throw that is all net or rattles in? It is more likely that an NBA player with an 80% free throw average makes a perfect free throw because a free throw that rattles in could easily have bounded the wrong way, which would lower the free throw percentage. To achieve an 80% free throw percentage, most free throws have to be close to perfect.

Let’s say the probability of hitting the rim and going in is 30%. With an 80% free throw average, this means that the majority of free throws are in the close-to-perfect category (20% misses, 30% rattle-in, 50% close-to-perfect).

What does this have to do with science? A lot!

The reason is that the outcome of a scientific study is a bit like throwing free throws. One factor that contributes to a successful study is skill (making correct predictions, avoiding experimenter errors, and conducting studies with high statistical power). However, another factor is random (a lucky or unlucky bounce).

The concept of statistical power is similar to an NBA players’ free throw percentage. A researcher who conducts studies with 80% statistical power is going to have an 80% success rate (that is, if all predictions are correct). In the remaining 20% of studies, a study will not produce a statistically significant result, which is equivalent to missing a free throw and not getting a point.

Many years ago, Jacob Cohen observed that researchers often conduct studies with relatively low power to produce a statistically significant result. Let’s just assume right now that a researcher conducts studies with 60% power. This means, researchers would be like NBA players with a 60% free-throw average.

Now imagine that researchers have to demonstrate an effect not only once, but also a second time in an exact replication study. That is researchers have to make two free throws in a row. With 60% power, the probability to get two significant results in a row is only 36% (60% * 60%). Moreover, many of the freethrows that are made rattle in rather than being all net. The percentages are about 40% misses, 30% rattling in and 30% all net.

One major difference between NBA players and scientists is that NBA players have to demonstrate their abilities in front of large crowds and TV cameras, whereas scientists conduct their studies in private.

Imagine an NBA player could just go into a private room, throw two free throws and then report back how many free throws he made and the outcome of these free throws determine who wins game 7 in the playoff finals. Would you trust the player to tell the truth?

If you would not trust the NBA player, why would you trust scientists to report failed studies? You should not.

It can be demonstrated statistically that scientists are reporting more successes than the power of their studies would justify (Sterling et al., 1995; Schimmack, 2012). Amongst scientists this fact is well known, but the general public may not fully appreciate the fact that a pair of exact replication studies with significant results is often just a selection of studies that included failed studies that were not reported.

Fortunately, it is possible to use statistics to examine whether the results of a pair of studies are likely to be honest or whether failed studies were excluded. The reason is that an amateur is not only more likely to miss a free throw. An amateur is also less likely to make a perfect free throw.

Based on the theory of statistical power developed by Nyman and Pearson and popularized by Jacob Cohen, it is possible to make predictions about the relative frequency of p-values in the non-significant (failure), just significant (rattling in), and highly significant (all net) ranges.

As for made-free-throws, the distinction between lucky and clear successes is somewhat arbitrary because power is continuous. A study with a p-value of .0499 is very lucky because p = .501 would have been not significant (rattled in after three bounces on the rim). A study with p = .000001 is a clear success. Lower p-values are better, but where to draw the line?

As it turns out, Jacob Cohen’s recommendation to conduct studies with 80% power provides a useful criterion to distinguish lucky outcomes and clear successes.

Imagine a scientist conducts studies with 80% power. The distribution of observed test-statistics (e.g. z-scores) shows that this researcher has a 20% chance to get a non-significant result, a 30% chance to get a lucky significant result (p-value between .050 and .005), and a 50% chance to get a clear significant result (p < .005). If the 20% failed studies are hidden, the percentage of results that rattled in versus studies with all-net results are 37 vs. 63%. However, if true power is just 20% (an amateur), 80% of studies fail, 15% rattle in, and 5% are clear successes. If the 80% failed studies are hidden, only 25% of the successful studies are all-net and 75% rattle in.

One problem with using this test to draw conclusions about the outcome of a pair of exact replication studies is that true power is unknown. To avoid this problem, it is possible to compute the maximum probability of a rattling-in result. As it turns out, the optimal true power to maximize the percentage of lucky outcomes is 66% power. With true power of 66%, one would expect 34% misses (p > .05), 32% lucky successes (.050 < p < .005), and 34% clear successes (p < .005).

LuckyBounceTest

For a pair of exact replication studies, this means that there is only a 10% chance (32% * 32%) to get two rattle-in successes in a row. In contrast, there is a 90% chance that misses were not reported or that an honest report of successful studies would have produced at least one all-net result (z > 2.8, p < .005).

Example: Unconscious Priming Influences Behavior

I used this test to examine a famous and controversial set of exact replication studies. In Bargh, Chen, and Burrows (1996), Dr. Bargh reported two exact replication studies (studies 2a and 2b) that showed an effect of a subtle priming manipulation on behavior. Undergraduate students were primed with words that are stereotypically associated with old age. The researchers then measured the walking speed of primed participants (n = 15) and participants in a control group (n = 15).

The two studies were not only exact replications of each other; they also produced very similar results. Most readers probably expected this outcome because similar studies should produce similar results, but this false belief ignores the influence of random factors that are not under the control of a researcher. We do not expect lotto winners to win the lottery again because it is an entirely random and unlikely event. Experiments are different because there could be a systematic effect that makes a replication more likely, but in studies with low power results should not replicate exactly because random sampling error influences results.

Study 1: t(28) = 2.86, p = .008 (two-tailed), z = 2.66, observed power = 76%
Study 2: t(28) = 2.16, p = .039 (two-tailed), z = 2.06, observed power = 54%

The median power of these two studies is 65%. However, even if median power were lower or higher, the maximum probability of obtaining two p-values in the range between .050 and .005 remains just 10%.

Although this study has been cited over 1,000 times, replication studies are rare.

One of the few published replication studies was reported by Cesario, Plaks, and Higgins (2006). Naïve readers might take the significant results in this replication study as evidence that the effect is real. However, this study produced yet another lucky success.

Study 3: t(62) = 2.41, p = .019, z = 2.35, observed power = 65%.

The chances of obtaining three lucky successes in a row is only 3% (32% *32% * 32*). Moreover, with a median power of 65% and a reported success rate of 100%, the success rate is inflated by 35%. This suggests that the true power of the reported studies is considerably lower than the observed power of 65% and that observed power is inflated because failed studies were not reported.

The R-Index corrects for inflation by subtracting the inflation rate from observed power (65% – 35%). This means the R-Index for this set of published studies is 30%.

This R-Index can be compared to several benchmarks.

An R-Index of 22% is consistent with the null-hypothesis being true and failed attempts are not reported.

An R-Index of 40% is consistent with 30% true power and all failed attempts are not reported.

It is therefore not surprising that other researchers were not able to replicate Bargh’s original results, even though they increased statistical power by using larger samples (Pashler et al. 2011, Doyen et al., 2011).

In conclusion, it is unlikely that Dr. Bargh’s original results were the only studies that they conducted. In an interview, Dr. Bargh revealed that the studies were conducted in 1990 and 1991 and that they conducted additional studies until the publication of the two studies in 1996. Dr. Bargh did not reveal how many studies they conducted over the span of 5 years and how many of these studies failed to produce significant evidence of priming. If Dr. Bargh himself conducted studies that failed, it would not be surprising that others also failed to replicate the published results. However, in a personal email, Dr. Bargh assured me that “we did not as skeptics might presume run many studies and only reported the significant ones. We ran it once, and then ran it again (exact replication) in order to make sure it was a real effect.” With a 10% probability, it is possible that Dr. Bargh was indeed lucky to get two rattling-in findings in a row. However, his aim to demonstrate the robustness of an effect by trying to show it again in a second small study is misguided. The reason is that it is highly likely that the effect will not replicate or that the first study was already a lucky finding after some failed pilot studies. Underpowered studies cannot provide strong evidence for the presence of an effect and conducting multiple underpowered studies reduces the credibility of successes because the probability of this outcome to occur even when an effect is present decreases with each study (Schimmack, 2012). Moreover, even if Bargh was lucky to get two rattling-in results in a row, others will not be so lucky and it is likely that many other researchers tried to replicate this sensational finding, but failed to do so. Thus, publishing lucky results hurts science nearly as much as the failure to report failed studies by the original author.

Dr. Bargh also failed to realize how lucky he was to obtain his results, in his response to a published failed-replication study by Doyen. Rather than acknowledging that failures of replication are to be expected, Dr. Bargh criticized the replication study on methodological grounds. There would be a simple solution to test Dr. Bargh’s hypothesis that he is a better researcher and that his results are replicable when the study is properly conducted. He should demonstrate that he can replicate the result himself.

In an interview, Tom Bartlett asked Dr. Bargh why he didn’t conduct another replication study to demonstrate that the effect is real. Dr. Bargh’s response was that “he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.” The problem for Dr. Bargh is that there is no reason to believe his original results, either. Two rattling-in results alone do not constitute evidence for an effect, especially when this result could not be replicated in an independent study. NBA players have to make free-throws in front of a large audience for a free-throw to count. If Dr. Bargh wants his findings to count, he should demonstrate his famous effect in an open replication study. To avoid embarrassment, it would be necessary to increase the power of the replication study because it is highly unlikely that even Dr. Bargh can continuously produce significant results with samples of N = 30 participants. Even if the effect is real, sampling error is simply too large to demonstrate the effect consistently. Knowledge about statistical power is power. Knowledge about post-hoc power can be used to detect incredible results. Knowledge about a priori power can be used to produce credible results.

Swish!

Advertisements

The Association for Psychological Science Improves Success Rate from 95% to 100% by Dropping Hypothesis Testing: The Sample Mean is the Sample Mean, Type-I Error 0%

The editor of Psychological Science published an Editorial with the title “Business Not as Usual.” (see also Observer interview and new Submission Guidelines) The new submission guidelines recommend the following statistical approach.

Effective January 2014, Psychological Science recommends the use of the “new statistics”—effect sizes, confidence intervals, and meta-analysis—to avoid problems associated with null-hypothesis significance testing (NHST). Authors are encouraged to consult this Psychological Science tutorial by Geoff Cumming, which shows why estimation and meta-analysis are more informative than NHST and how they foster development of a cumulative, quantitative discipline. Cumming has also prepared a video workshop on the new statistics that can be found here.

The editorial is a response to the current crisis in psychology that many findings cannot be replicated and the discovery that numerous articles in Psychological Science show clear evidence of reporting biases that lead to inflated false-positive rates and effect sizes (Francis, 2013).

The editorial is titled “Business not as usual.”  So what is the radical response that will ensure increased replicability of results published in Psychological Science? One solution is to increase transparency and openness to discourage the use of deceptive research practices (e.g., not publishing undesirable results or selective reporting of dependent variables that showed desirable results). The other solution is to abandon null-hypothesis significance testing.

Problem of the Old Statistics: Researchers had to demonstrate that their empirical results could have occurred only with a 5% probability if there is no effect in the population.

Null-hypothesis testing has been the main method to relate theories to empirical data. An article typically first states a theory and then derives a theoretical prediction from the theory. The theoretical prediction is then used to design a study that can be used to test the theoretical prediction. The prediction is tested by computing the ratio of the effect size and sampling error (signal-to-noise) ratio. The next step is to determine the probability of obtaining the observed signal-to-noise ratio or an even more extreme one under the assumption that the true effect size is zero. If this probability is smaller than a criterion value, typically p < .05, the results are interpreted as evidence that the theoretical prediction is true. If the probability does not meet the criterion, the data are considered inconclusive.

However, non-significant results are irrelevant because Psychological Science is only interested in publishing research that supports innovative novel findings. Nobody wants to know that drinking fennel tea does not cure cancer, but everybody wants to know about a treatment that actually cures cancer. So, the main objective of statistical analyses was to provide empirical evidence for a predicted effect by demonstrating that an obtained result would occur only with a 5% probability if the hypothesis were false.

Solution to the problem of Significance Testing: Drop the Significance Criterion. Just report your sample mean and the 95% confidence interval around it.

NoNeedForNull

Eich claims that “researchers have recognized,…, essential problems with NHST in general, and with dichotomous thinking (“significant” vs. “non-significant” ) thinking it engenders in particular. It is true that statisticians have been arguing about the best way to test theoretical predictions with empirical data. In fact, they are still arguing. Thus, it is interesting to examine how Psychological Science found a solution to the elusive problem of statistical inference. The answer is to avoid statistical inferences altogether and to avoid dichotomous thinking. Does fennel tea cure cancer? Maybe, 95%CI d = -.4 to d = +4. No need to test for statistical significance. No need to worry about inadequate sample sizes. Just do a study and report your sample means with a confidence interval. It is that easy to fix the problems of psychological science.

The problem is that every study produces a sample mean and a confidence interval. So, how do the editors of Psychological Science pick the 5% of submitted manuscripts that will be accepted for publication? Eich lists three criteria.

  1. What will the reader of this article learn about psychology that he or she did not know (or could not have known) before?

The effect of manipulation X on dependent variable Y is d = .2, 95%CI = -.2 to .6. We can conclude from this result that it is unlikely that the manipulation leads to a moderate decrease or a strong increase in the dependent variable Y.

  1. Why is that knowledge important for the field?

The finding that the experimental manipulation of Y in the laboratory is somewhat more likely to produce an increase than a decrease, but could also have no effect at all has important implications for public policy.

  1. How are the claims made in the article justified by the methods used?

The claims made in this article are supported by the use of Cumming’s New Statistics. Based on a precision analysis, the sample size was N = 100 (n = 50 per condition) to achieve a precision of .4 standard deviations. The study was preregistered and the data are publicly available with the code to analyze the data (SPPS t-test groups x (1,2) / var y.).

If this sounds wrong to you and you are a member of APS, you may want to write to Erich Eich and ask for some better guidelines that can be used to evaluate whether a sample mean or two or three or four sample means should be published in your top journal.

R-INDEX BULLETIN (RIB): Share the Results of your R-Index Analysis with the Scientific Community

Replicability-Index

R-Index Bulletin

The world of scientific publishing is changing rapidly and there is a growing need to share scientific information as fast and as cheap as possible.

Traditional journals with pre-publication peer-review are too slow and focussed on major ground-breaking discoveries.

Open-access journals can be expensive.

R-Index Bulletin offers a new opportunity to share results with the scientific community quickly and free of charge.

R-Index Bulletin also avoids the problems of pre-publication peer-review by moving to a post-publication peer-review process. Readers are welcome to comment on posted contributions and to post their own analyses. This process ensures that scientific disputes and their resolution are open and part of the scientific process.

For the time being, submissions can be uploaded as comments to this blog. In the future, R-Index Bulletin may develop into a free online journal.

A submission should contain a brief description of the research question (e.g., what is…

View original post 121 more words

A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

Cumming’s “New Statistics”: Not New, Not Good, and Not Worth Buying.

Replicability-Index

Cumming (2014) wrote an article “The New Statistics: Why and How” that was published in the prestigious journal Psychological Science.   On his website, Cumming uses this article to promote his book “Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.”

The article clear states the conflict of interest. “The author declared that he earns royalties on his book (Cumming, 2012) that is referred to in this article.” Readers are therefore warned that the article may at least inadvertently give an overly positive account of the new statistics and an overly negative account of the old statistics. After all, why would anybody buy a book about new statistics when the old statistics are working just fine.

This blog post critically examines Cumming’s claim that his “new statistics” can solve endemic problems in psychological research that have created a replication crisis and that…

View original post 5,133 more words

Power Analysis for Bayes-Factor: What is the Probability that a Study Produces an Informative Bayes-Factor?

Jacob Cohen has warned fellow psychologists about the problem of conducting studies with insufficient statistical power to demonstrate predicted effects in 1962. The problem is simple enough. An underpowered study has only a small chance to produce the correct result; that is, a statistically significant result when an effect is present.

Many researchers have ignored Cohen’s advice to conduct studies with at least 80% power, that is, an 80% probability to produce the correct result when an effect is present because they were willing to pay low odds. Rather than conducting a single powerful study with 80% power, it seemed less risky to conduct three underpowered studies with 30% power. The chances of getting a significant result are similar (the power to get a significant result in at least 1 out of 3 studies with 30% power is 66%). Moreover, the use of smaller samples is even less problematic if a study tests multiple hypotheses. With 80% power to detect a single effect, a study with two hypotheses has a 96% probability that at least one of the two effects will produce a significant result. Three studies allow for six hypotheses tests. With 30% power to detect at least one of the two effects in six attempts, power to obtain at least one significant result is 88%. Smaller samples also provide additional opportunities to increase power by increasing sample sizes until a significant result is obtained (optional stopping) or by eliminating outliers. The reason is that these questionable practices have larger effects on the results in smaller samples. Thus, for a long time researchers did not feel a need to conduct adequately powered studies because there was no shortage of significant results to report (Schimmack, 2012).

Psychologists have ignored the negative consequences of relying on underpowered studies to support their conclusions. The problem is that the reported p-values are no longer valid. A significant result that was obtained by conducting three studies no longer has a 5% chance to be a random event. By playing the sampling-error lottery three times, the probability of obtaining a significant result by chance alone is now 15%. By conducting three studies with two hypothesis tests, the probability of obtaining a significant result by chance alone is 30%. When researchers use questionable research practices, the probability of obtaining a significant result by chance can further increase. As a result, a significant result no longer provides strong statistical evidence that the result was not just a random event.

It would be easy to distinguish real effects from type-I errors (significant results when the null-hypothesis is true) by conducting replication studies. Even underpowered studies with 30% power will replicate in every third study. In contrast, when the null-hypothesis is true, type-I errors will replicate only in 1 out of 20 studies, when the criterion is set to 5%. This is what a 5% criterion means. There is only a 5% chance (1 out of 20) to get a significant result when the null-hypothesis is true. However, this self-correcting mechanism failed because psychologists considered failed replication studies as uninformative. The perverse logic was that failed replications are to be expected because studies have low power. After all, if a study has only 30% power, a non-significant result is more likely than a significant result. So, non-significant results in underpowered studies cannot be used to challenge a significant result in an underpowered study. By this perverse logic, even false hypothesis will only receive empirical support because only significant results will be reported, no matter whether an effect is present or not.

The perverse consequences of abusing statistical significance tests became apparent when Bem (2011) published 10 studies that appeared to demonstrate that people can anticipate random future events and that practicing for an exam after writing an exam can increase grades. These claims were so implausible that few researchers were willing to accept Bem’s claims despite his presentation of 9 significant results in 10 studies. Although the probability that this even occurred by chance alone is less than 1 in a billion, few researchers felt compelled to abandon the null-hypothesis that studying for an exam today can increase performance on yesterday’s exam.   In fact, most researchers knew all too well that these results could not be trusted because they were aware that published results are not an honest report of what happens in a lab. Thus, a much more plausible explanation for Bem’s incredible results was that he used questionable research practices to obtain significant results. Consistent with this hypothesis, closer inspection of Bem’s results shows statistical evidence that Bem used questionable research practices (Schimmack, 2012).

As the negative consequences of underpowered studies have become more apparent, interest in statistical power has increased. Computer programs make it easy to conduct power analysis for simple designs. However, so far power analysis has been limited to conventional statistical methods that use p-values and a criterion value to draw conclusions about the presence of an effect (Neyman-Pearson Significance Testing, NPST).

Some researchers have proposed Bayesian statistics as an alternative approach to hypothesis testing. As far as I know, these researchers have not provided tools for the planning of sample sizes. One reason is that Bayesian statistics can be used with optional stopping. That is, a study can be terminated early when a criterion value is reached. However, an optional stopping rule also needs a rule when data collection will be terminated in case the criterion value is not reached. It may sound appealing to be able to finish a study at any moment, but if this event is unlikely to occur in a reasonably sized sample, the study would produce an inconclusive result. Thus, even Bayesian statisticians may be interested in the effect of sample sizes on the ability to obtain a desired Bayes-Factor. Thus, I wrote some r-code to conduct power analysis for Bayes-Factors.

The code uses the Bayes-Factor package in r for the default Bayesian t-test (see also blog post on Replication-Index blog). The code is posted at the end of this blog. Here I present results for typical sample sizes in the between-subject design for effect sizes ranging from 0 (the null-hypothesis is true) to Cohen’s d = .5 (a moderate effect). Larger effect sizes are not reported because large effects are relatively easy to detect.

The first table shows the percentage of studies that meet a specified criterion value based on 10,000 simulations of a between-subject design. For Bayes-Factors the criterion values are 3 and 10. For p-values the criterion values are .05, .01, and .001. For Bayes-Factors, a higher number provides stronger support for a hypothesis. For p-values, lower values provide stronger support for a hypothesis. For p-values, percentages correspond to the power of a study. Bayesian statistics has no equivalent concept, but percentages can be used in the same way. If a researcher aims to provide empirical support for a hypothesis with a Bayes-Factor greater than 3 or 10, the table gives the probability of obtaining the desired outcome (success) as a function of the effect size and sample size.

d   n     N     3   10     .05 .01     .001
.5   20   40   17   06     31     11     02
.4   20   40   12   03     22     07     01
.3   20   40   07   02     14     04     00
.2   20   40   04   01     09     02     00
.1   20   40   02   00     06     01     00
.0   20   40   33   00     95     99   100

For an effect size of zero, the interpretation of results switches. Bayes-Factors of 1/3 or 1/10 are interpreted as evidence for the null-hypothesis. The table shows how often Bayes-Factors provide support for the null-hypothesis as a function of the effect size, which is zero, and sample size. For p-values, the percentage is 1 – p. That is, when the effect is zero, the p-value will correctly show a non-significant result with a probability of 1 – p and it will falsely reject the null-hypothesis with the specified type-I error.

Typically, researchers do not interpret non-significant results as evidence for the null-hypothesis. However, it is possible to interpret non-significant results in this way, but it is important to take the type-II error rate into account. Practically, it makes little difference whether a non-significant result is not interpreted or whether it is taken as evidence for the null-hypothesis with a high type-II error probability. To illustrate this consider a study with N = 40 (n = 20 per group) and an effect size of d = .2 (a small effect). As there is a small effect, the null-hypothesis is false. However, the power to detect this effect in a small sample is very low. With p = .05 as the criterion, power is only 9%. As a result, there is a 91% probability to end up with a non-significant result even though the null-hypothesis is false. This probability is only slightly lower than the probability to get a non-significant result when the null-hypothesis is true (95%). Even if the effect size were d = .5, a moderate effect, power is only 31% and the type-II error rate is 69%. With type-II error rates of this magnitude, it makes practically no difference whether a null-hypothesis is accepted with a warning that the type-II error rate is high or whether the non-significant result is simply not interpreted because it provides insufficient information about the presence or absence of small to moderate effects.

The main observation in Table 1 is that small samples provide insufficient information to distinguish between the null-hypothesis and small to moderate effects. Small studies with N = 40 are only meaningful to demonstrate the presence of moderate to large effects, but they have insufficient power to show effects and insufficient power to show the absence of effects. Even when the null-hypothesis is true, a Bayes-Factor of 3 is reached only 33% of the time. A Bayes-Factor of 10 is never reached because the sample size is too small to provide such strong evidence for the null-hypothesis when the null-hypothesis is true. Even more problematic is that a Bayes-Factor of 3 is reached only 17% of the time when a moderate effect is present. Thus, the most likely outcome in small samples is an inconclusive result unless a strong effect is present. This means that Bayes-Factors in these studies have the same problem as p-values. They can only provide evidence that an effect is present when a strong effect is present, but they cannot provide sufficient evidence for the null-hypothesis when the null-hypothesis is true.

d   n     N     3   10     .05 .01     .001
.5   50 100   49   29     68     43     16
.4   50 100   30   15     49     24     07
.3   50 100   34   18     56     32     12
.2   50 100   07   02     16     05     01
.1   50 100   03   01     08     02     00
.0   50 100   68   00     95     99   100

In Table 2 the sample size has been increased to N = 100 participants (n = 50 per cell). This is already a large sample size by past standards in social psychology. Moreover, in several articles Wagenmakers has implemented a stopping rule that terminates data collection at this point. The table shows that a sample size of N = 100 in a between-subject design has modest power to demonstrate even moderate effect sizes of d = .5 with a Bayes-Factor of 3 as a criterion (49%). In comparison, a traditional p-value of .05 would provide 68% power.

The main argument for using Bayesian statistics is that it can also provide evidence for the null-hypothesis. With a criterion value of BF = 3, the default test correctly favors the null-hypothesis 68% of the time (see last row of the table). However, the sample size is too small to produce Bayes-Factors greater than 10. In sum, the default-Bayesian t-test with N = 100 can be used to demonstrate the presence of a moderate to large effects and with a criterion value of 3 it can be used to provide evidence for the null-hypothesis when the null-hypothesis is true. However it cannot be used to demonstrate that provide evidence for small to moderate effects.

The Neyman-Pearson approach to significance testing would reveal this fact in terms of the type-I I error rates associated with non-significant results. Using the .05 criterion, a non-significant result would be interpreted as evidence for the null-hypothesis. This conclusion is correct in 95% of all tests when the null-hypothesis is actually true. This is higher than the 68% criterion for a Bayes-Factor of 3. However, the type-II error rates associated with this inference when the null-hypothesis is false are 32% for d = .5, 51% for d = .4, 44% for d = .3, 84% for d = .2, and 92% for d = .1. If we consider effect size of d = .2 as important enough to be detected (small effect size according to Cohen), the type-II error rate could be as high as 84%.

In sum, a sample size of N = 100 in a between-subject design is still insufficient to test for the presence of a moderate effect size (d = .5) with a reasonable chance to find it (80% power). Moreover, a non-significant result is unlikely to occur for moderate to large effect sizes, but the sample size is insufficient to discriminate accurately between the null-hypothesis and small to moderate effects. A Bayes-Factor greater than 3 in favor of the null-hypothesis is most likely to occur when the null-hypothesis is true, but it can also occur when a small effect is present (Simonsohn, 2015).

The next table increases the total sample size to 200 for a between-subject design. The pattern doesn’t change qualitatively. So the discussion will be brief and focus on the power of a study with 200 participants to provide evidence for small to moderate effects and to distinguish small to moderate effects from the null-hypothesis.

d   n     N     3   10     .05 .01     .001
.5 100 200   83   67     94     82     58
.4 100 200   60   41     80     59     31
.3 100 200   16   06     31     13     03
.2 100 200   13   06     29     12     03
.1 100 200   04   01     11     03     00
.0 100 200   80   00     95     95     95  

Using Cohen’s guideline of 80% success rate (power), a study with N = 200 participants has sufficient power to show a moderate effect of d = .5 with p = .05, p = .01, and Bayes-Factor = 3 as criterion values. For d = .4, only the criterion value of p = .05 has sufficient power. For all smaller effects, the sample size is still too small to have 80% power. A sample of N = 200 also provides 80% power to provide evidence for the null-hypothesis with a Bayes-Factor of 3. Power for a Bayes-Factor of 10 is still 0 because this value cannot be reached with N = 200. Finally, with N = 200, the type-II error rate for d = .5 is just shy of .05 (1 – .94 = .06). Thus, it is justified to conclude from a non-significant result with a 6% error rate that the true effect size cannot be moderate to large (d >= .5). However, type-II error rates for smaller effect sizes are too high to test the null-hypothesis against these effect sizes.

d   n     N     3   10     .05 .01     .001
.5 200 400   99   97   100     99     95
.4 200 400   92   82     98     92     75
.3 200 400   64   46     85     65     36
.2 200 400   27   14     52     28     10
.1 200 400   05   02     17     06     01
.0 200 400   87   00     95     99     95

The next sample size doubles the number of participants. The reason is that sampling error decreases in a log-function and large increases in sample sizes are needed to further decrease sampling error. A sample size of N = 200 yields a standard error of 2 / sqrt(200) = .14. (14/100 of a standard deviation). A sample size of N = 400 is needed to reduce this to .10 (2 / sqrt (400) = 2 / 20 = .10; 2/10 of a standard deviation).   This is the reason why it is so difficult to find small effects.

Even with N = 400, power is only sufficient to show effect sizes of .3 or greater with p = .05, or effect sizes of d = .4 with p = .01 or Bayes-Factor 3. Only d = .5 can be expected to meet the criterion p = .001 more than 80% of the time. Power for Bayes-Factors to show evidence for the null-hypothesis also hardly changed. It increased from 80% to 87% with Bayes-Factor = 3 as criterion. The chance to get a Bayes-Factor of 10 is still 0 because the sample size is too small to produce such extreme values. Using Neyman-Pearson’s approach with a 5% type-II error rate as criterion, it is possible to interpret non-significant results as evidence that the true effect size cannot be .4 or larger. With a 1% criterion it is possible to say that a moderate to large effect would produce a significant result 99% of the time and the null-hypothesis would produce a non-significant result 99% of the time.

Doubling the sample size to N = 800 reduces sampling error from SE = .1 to SE = .07.

d   n     N     3     10     .05   .01     .001
.5 400 800 100 100   100  100     100
.4 400 800 100   99   100  100       99
.3 400 800   94   86     99     95      82
.2 400 800   54   38     81     60      32
.1 400 800   09   04     17     06      01
.0 400 800   91   52     95     95      95

A sample size of N = 800 is sufficient to have 80% power to detect a small effect according to Cohen’s classification of effect sizes (d = .2) with p = .05 as criterion. Power to demonstrate a small effect with Bayes-Factor = 3 as criterion is only 54%. Power to demonstrate evidence for the null-hypothesis with Bayes-Factor = 3 as criterion increased only slightly from 87% to 91%, but a sample size of N = 100 is sufficient to produce Bayes-Factors greater than 10 in favor of the null-hypothesis 52% of the time. Thus, researchers who aim for this criterion value need to plan their studies with N = 800. Smaller samples cannot produce these values with the default Bayesian t-test. Following Neyman-Pearson, a non-significant result can be interpreted as evidence that the true effect cannot be larger than d = .3, with a type-II error rate of 1%.

Conclusion

A common argument in favor of Bayes-Factors has been that Bayes-Factors can be used to test the null-hypothesis, whereas p-values can only reject the null-hypothesis. There are two problems with this claim. First, it confuses Null-Significance-Testing (NHST) and Neyman-Pearson-Significance-Testing (NPST). NPST also allows researchers to accept the null-hypothesis. In fact, it makes it easier to accept the null-hypothesis because every non-significant result favors the null-hypothesis. Of course, this does not mean that all non-significant results show that the null-hypothesis is true. In NPST the error of falsely accepting the null-hypothesis depends on the amount of sampling error. The tables here make it possible to compare Bayes-Factors and NPST. No matter which statistical approach is being used, it is clear that meaningful evidence for the null-hypothesis requires rather large samples. The r-code below can be used to compute power for different criterion values, effect sizes, and sample sizes. Hopefully, this will help researchers to better plan sample sizes and to better understand Bayes-Factors that favor the null-hypothesis.

########################################################################
###                       R-Code for Power Analysis for Bayes-Factor and P-Values                ###
########################################################################

## setup
library(BayesFactor)         # Load BayesFactor package
rm(list = ls())                       # clear memory

## set parameters
nsim = 10000      #set number of simulations
es 1 favor effect)
BF10_crit = 3      #set criterion value for BF favoring effect (> 1 = favor null)
p_crit = .05          #set criterion value for two-tailed p-value (e.g., .05

## computations
Z <- matrix(rnorm(groups*n*nsim,mean=0,sd=1),nsim,groups*n)   # create observations
Z[,1:n] <- Z[,1:n] + es                                                                                                #add effect size
tt <- function(x) {                                                                                                       #compute t-statistic (t-test)
oes <- mean(x[1:n])                                                                                    #compute mean group 1
if (groups == 2) oes = oes – mean(x[(n+1):(2*n)])                                  #compute mean for 2 groups
oes <- oes / sd(x[1:n*groups])                                                                  #compute observed effect size
t <- abs(oes) / (groups / sqrt(n*groups))                                                 #compute t-value
}

t <- apply(Z,1,function(x) tt(x))                                                                                 #get t-values for all simulations
df <- t – t + n*groups-groups                                                                                    #get degrees of freedom
p2t <- (1 – pt(abs(t),df))*2                                                                                         #compute two-tailed p-value
getBF <- function(x) {                                                                                                 #function to get Bayes-Factor
t <- x[1]
df <- x[2]
bf <- exp(ttest.tstat(t,(df+2)/2,(df+2)/2,rscale=rsc)$bf)
}              # end of function to get Bayes-Factor

input = matrix(cbind(t,df),,2)                                                                  # combine t and df values
BF10 <- apply(input,1, function(x) getBF(x) )                                        # get BF10 for all simulations
powerBF10 = length(subset(BF10, BF10 > BF10_crit))/nsim*100        # % results support for effect
powerBF01 = length(subset(BF10, BF10 < 1/BF10))/nsim*100            # % results support for null
powerP = length(subset(p2t, p2t < .05))/nsim*100                                # % significant, p < p-criterion

##output of results
cat(
” Power to support effect with BF10 >”,BF10_crit,”: “,powerBF10,
“\n”,
“Power to support null with BF01 >”,BF01_crit,” : “,powerBF01,
“\n”,
“Power to show effect with p < “,p_crit,” : “,powerP,
“\n”)

A Critical Review of Cumming’s (2014) New Statistics: Reselling Old Statistics as New Statistics

Cumming (2014) wrote an article “The New Statistics: Why and How” that was published in the prestigious journal Psychological Science.   On his website, Cumming uses this article to promote his book “Cumming, G. (2012). Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. New York: Routledge.”

The article clear states the conflict of interest. “The author declared that he earns royalties on his book (Cumming, 2012) that is referred to in this article.” Readers are therefore warned that the article may at least inadvertently give an overly positive account of the new statistics and an overly negative account of the old statistics. After all, why would anybody buy a book about new statistics when the old statistics are working just fine.

This blog post critically examines Cumming’s claim that his “new statistics” can solve endemic problems in psychological research that have created a replication crisis and that the old statistics are the cause of this crisis.

Like many other statisticians who are using the current replication crisis as an opportunity to sell their statistical approach, Cumming’s blames null-hypothesis significance testing (NHST) for the low credibility of research articles in Psychological Science (Francis, 2013).

In a nutshell, null-hypothesis significance testing entails 5 steps. First, researchers conduct a study that yields an observed effect size. Second, the sampling error of the design is estimated. Third, the ratio of the observed effect size and sampling error (signal-to-noise ratio) is computed to create a test-statistic (t, F, chi-square). The test-statistic is then used to compute the probability of obtaining the observed test-statistic or a larger one under the assumption that the true effect size in the population is zero (there is no effect or systematic relationship). The last step is to compare the test statistic to a criterion value. If the probability (p-value) is less than a criterion value (typically 5%), the null-hypothesis is rejected and it is concluded that an effect was present.

Cumming’s (2014) claims that we need a new way to analyze data because there is “renewed recognition of the severe flaws of null-hypothesis significance testing (NHST)” (p. 7). His new statistical approach has “no place for NHST” (p. 7). His advice is to “whenever possible, avoid using statistical significance or p values” (p. 8).

So what is wrong with NHST?

The first argument against NHST is that Ioannidis (2005) wrote an influential article with the eye-catching title “Why most published research findings are false” and most research articles use NHST to draw inferences from the observed results. Thus, NHST seems to be a flawed method because it produces mostly false results. The problem with this argument is that Ioannidis (2005) did not provide empirical evidence that most research findings are false, nor is this a particularly credible claim for all areas of science that use NHST, including partical physics.

The second argument against NHST is that researchers can use questionable research practices to produce significant results. This is not really a criticism of NHST, because researchers under pressure to publish are motivated to meet any criteria that are used to select articles for publication. A simple solution to this problem would be to publish all submitted articles in a single journal. As a result, there would be no competition for limited publication space in more prestigious journals. However, better studies would be cited more often and researchers will present their results in ways that lead to more citations. It is also difficult to see how psychology can improve its credibility by lowering standards for publication. A better solution would be to ensure that researchers are honestly reporting their results and report credible evidence that can provide a solid empirical foundation for theories of human behavior.

Cummings agrees. “To ensure integrity of the literature, we must report all research conducted to a reasonable standard, and reporting must be full and accurate” (p. 9). If a researcher conducted five studies with only a 20% chance to get a significant result and would honestly report all five studies, p-values would provide meaningful evidence about the strength of the evidence, namely most p-values would be non-significant and show that the evidence is weak. Moreover, post-hoc power analysis would reveal that the studies had indeed low power to test a theoretical prediction. Thus, I agree with Cumming’s that honesty and research integrity are important, but I see no reason to abandon NHST as a systematic way to draw inferences from a sample about the population because researchers have failed to disclose non-significant results in the past.

Cumming’s then cites a chapter by Kline (2014) that “provided an excellent summary of the deep flaws in NHST and how we use it” (p. 11). Apparently, the summary is so excellent that readers are better off by reading the actual chapter because Cumming’s does not explain what these deep flaws are. He then observes that “very few defenses of NHST have been attempted” (p. 11). He doesn’t even list a single reference. Here is one by a statistician: “In defence of p-values” (Murtaugh, 2014). In a response, Gelman agrees that the problem is more with the way p-values are used rather than with the p-value and NHST per se.

Cumming’s then states a single problem of NHST. Namely that it forces researchers to make a dichotomous decision. If the signal-to-noise ratio is above a criterion value, the null-hypothesis is rejected and it is concluded that an effect is present. If the signal-to-noise ratio is below the criterion value the null-hypothesis is not rejected. If Cumming’s has a problem with decision making, it would be possible to simply report the signal-to-noise ratio or simply to report the effect size that was observed in a sample. For example, mortality in an experimental Ebola drug trial was 90% in the control condition and 80% in the experimental condition. As this is the only evidence, it is not necessary to compute sampling error, signal-to-noise ratios, or p-values. Given all of the available evidence, the drug seems to improve survival rates. But wait. Now a dichotomous decision is made based on the observed mean difference and there is no information about the probability that the results in the drug trial generalize to the population. Maybe the finding was a chance finding and the drug actually increases mortality. Should we really make life-and-death decision if the decision were based on the fact that 8 out of 10 patients died in one condition and 9 out of 10 patients died in the other condition?

Even in a theoretical research context decisions have to be made. Editors need to decide whether they accept or reject a submitted manuscript and readers of published studies need to decide whether they want to incorporate new theoretical claims in their theories or whether they want to conduct follow-up studies that build on a published finding. It may not be helpful to have a fixed 5% criterion, but some objective information about the probability of drawing the right or wrong conclusions seems useful.

Based on this rather unconvincing critique of p-values, Cumming’s (2014) recommends that “the best policy is, whenever possible, not to use NHST at all” (p. 12).

So what is better than NHST?

Cumming then explains how his new statistics overcome the flaws of NHST. The solution is simple. What is astonishing about this new statistic is that it uses the exact same components as NHST, namely the observed effect size and sampling error.

NHST uses the ratio of the effect size and sampling error. When the ratio reaches a value of 2, p-values reach the criterion value of .05 and are considered sufficient to reject the null-hypothesis.

The new statistical approach is to multiple the standard error by a factor of 2 and to add and subtract this value from the observed mean. The interval from the lower value to the higher value is called a confidence interval. The factor of 2 was chosen to obtain a 95% confidence interval.  However, drawing a confidence interval alone is not sufficient to draw conclusions from the data. Whether we describe the results in terms of a ratio, .5/.2 = 2.5 or in terms of a 95%CI = .5 +/- .2 or CI = .1 to .7, is not a qualitative difference. It is simply different ways to provide information about the effect size and sampling error. Moreover, it is arbitrary to multiply the standard error by a factor of 2. It would also be possible to multiply it by a factor of 1, 3, or 5. A factor of 2 is used to obtain a 95% confidence interval rather than a 20%, 50%, 80%, or 99% confidence interval. A 95% confidence is commonly used because it corresponds to a 5% error rate (100 – 95 = 5!). A 95% confidence interval is as arbitrary as a p-value of .05.

So, how can a p-value be fundamentally wrong and how can a confidence interval be the solution to all problems if they provide the same information about effect size and sampling error? In particular how do confidence intervals solve the main problem of making inferences from an observed mean in a sample about the mean in a population?

To sell confidence intervals, Cumming’s uses a seductive example.

“I suggest that, once freed from the requirement to report p values, we may appreciate how simple, natural, and informative it is to report that “support for Proposition X is 53%, with a 95% CI of [51, 55],” and then interpret those point and interval estimates in practical terms” (p 14).

Support for proposition X is a rather unusual dependent variable in psychology. However, let us assume that Cumming refers to an opinion poll among psychologists whether NHST should be abandoned. The response format is a simple yes/no format. The average in the sample is 53%. The null-hypothesis is 50%. The observed mean of 53% in the sample shows more responses in favor of the proposition. To compute a significance test or to compute a confidence interval, we need to know the standard error. The confidence interval ranges from 51% to 55%. As the 95% confidence interval is defined by the observed mean plus/minus two standard errors, it is easy to see that the standard error is SE = (53-51)/2 = 1% or .01. The formula for the standard error in a one sample test with a dichotomous dependent variable is sqrt(p * (p-1) / n)). Solving for n yields a sample size of N = 2,491. This is not surprising because public opinion polls often use large samples to predict election outcomes because small samples would not be informative. Thus, Cumming’s example shows how easy it is to draw inferences from confidence intervals when sample sizes are large and confidence intervals are tight. However, it is unrealistic to assume that psychologists can and will conduct every study with samples of N = 1,000. Thus, the real question is how useful confidence intervals are in a typical research context, when researchers do not have sufficient resources to collect data from hundreds of participants for a single hypothesis test.

For example, sampling error for a between-subject design with N = 100 (n = 50 per cell) is SE = 2 / sqrt(100) = .2. Thus, the lower and upper limit of the 95%CI are 4/10 of a standard deviation away from the observed mean and the full width of the confidence interval covers 8/10th of a standard deviation. If the true effect size is small to moderate (d = .3) and a researcher happens to obtain the true effect size in a sample, the confidence interval would range from d = -.1 to d = .7. Does this result support the presence of a positive effect in the population? Should this finding be published? Should this finding be reported in newspaper articles as evidence for a positive effect? To answer this question, it is necessary to have a decision criterion.

One way to answer this question is to compute the signal-to-noise ratio, .3/.2 = 1.5 and to compute the probability that the positive effect in the sample could have occurred just by chance, t(98) = .3/.2 = 1.5, p = .15 (two-tailed). Given this probability, we might want to see stronger evidence. Moreover, a researcher is unlikely to be happy with this result. Evidently, it would have been better to conduct a study that could have provided stronger evidence for the predicted effect, say a confidence interval of d = .25 to .35, but that would have required a sample size of N = 6,500 participants.

A wide confidence interval can also suggest that more evidence is needed, but the important question is how much more evidence is needed and how narrow a confidence interval should be before it can give confidence in a result. NHST provides a simple answer to this question. The evidence should be strong enough to reject the null-hypothesis with a specified error rate. Cumming’s new statistics provides no answer to the important question. The new statistics is descriptive, whereas NHST is an inferential statistic. As long as researchers merely want to describe their data, they can report their results in several ways, including reporting of confidence intervals, but when they want to draw conclusions from their data to support theoretical claims, it is necessary to specify what information constitutes sufficient empirical evidence.

One solution to this dilemma is to use confidence intervals to test the null-hypothesis. If the 95% confidence interval does not include 0, the ratio of effect size / sampling error is greater than 2 and the p-value would be less than .05. This is the main reason why many statistics programs report 95%CI intervals rather than 33%CI or 66%CI. However, the use of 95% confidence intervals to test significance is hardly a new statistical approach that justifies the proclamation of a new statistic that will save empirical scientists from NHST. It is NHST! Not surprisingly, Cumming’s states that “this is my least preferred way to interpret a confidence interval” (p. 17).

However, he does not explain how researchers should interpret a 95% confidence interval that does include zero. Instead, he thinks it is not necessary to make a decision. “We should not lapse back into dichotomous thinking by attaching any particular importance to whether a value of interest lies just inside or just outside our CI.”

Does an experimental treatment for Ebolay work? CI = -.3 to .8. Let’s try it. Let’s do nothing and do more studies forever. The benefit of avoiding making any decisions is that one can never make a mistake. The cost is that one can also never claim that an empirical claim is supported by evidence. Anybody who is worried about dichotomous thinking might ponder the fact that modern information processing is built on the simple dichotomy of 0/1 bits of information and that it is common practice to decide the fate of undergraduate students on the basis of scoring multiple choice tests in terms of True or False answers.

In my opinion, the solution to the credibility crisis in psychology is not to move away from dichotomous thinking, but to obtain better data that provide more conclusive evidence about theoretical predictions and a simple solution to this problem is to reduce sampling error. As sampling error decreases, confidence intervals get smaller and are less likely to include zero when an effect is present and the signal-to-noise ratio increases so that p-values get smaller and smaller when an effect is present. Thus, less sampling error also means less decision errors.

The question is how small should sampling error be to reduce decision error and at what point are resources being wasted because the signal-to-noise ratio is clear enough to make a decision.

Power Analysis

Cumming’s does not distinguish between Fischer’s and Neyman-Pearson’s use of p-values. The main difference is that Fischer advocated the use of p-values without strict criterion values for significance testing. This approach would treat p-values just like confidence intervals as continuous statistics that do not imply an inference. A p-value of .03 is significant with a criterion value of .05, but it is not significant with a criterion value of .01.

Neyman-Pearson introduced the concept of a fixed criterion value to draw conclusions from observed data. A criterion value of p = .05 has a clear interpretation. It means that a test of 1,000 null-hypotheses is expected to produce about 50 significant results (type-I errors). A lower error rate can be achieved by lowering the criterion value (p < .01 or p < .001).

Importantly, Neyman-Pearson also considered the alternative problem that the p-value may fail to reach the critical value when an effect is actually present. They called this probability the type-II error. Unfortunately, social scientists have ignored this aspect of Neyman-Pearson Significance Testing (NPST). Researchers can avoid making type-II errors by reducing sampling error. The reason is that a reduction of sampling error increases the signal-to-noise ratio.

For example, the following p-values were obtained from simulating studies with 95% power. The graph only shows p-values greater than .001 to make the distribution of p-values more prominent. As a result 62.5% of the data are missing because these p-values are below p < .001. The histogram of p-values has been popularized by Simmonsohn et al. (2013) as a p-curve. The p-curve shows that p-values are heavily skewed towards low p-values. Thus, the studies provide consistent evidence that an effect is present, even though p-values can vary dramatically from one study (p = .0001) to the next (p = .02). The variability of p-values is not a problem for NPST as long as the p-values lead to the same conclusion because the magnitude of a p-value is not important in Neyman-Pearson hypothesis testing.

CumFig1

The next graph shows p-values for studies with 20% power. P-values vary just as much, but now the variation covers both sides of the significance criterion, p = .05. As a result, the evidence is often inconclusive and 80% of studies fail to reject the false null-hypothesis.

CumFig2

R-Code
seed = length(“Cumming’sDancingP-Values”)
power=.20
low_limit = .000
up_limit = .10
p <-(1-pnorm(rnorm(2500,qnorm(.975,0,1)+qnorm(.20,0,1),1),0,1))*2
hist(p,breaks=1000,freq=F,ylim=c(0,100),xlim=c(low_limit,up_limit))
abline(v=.05,col=”red”)
percent_below_lower_limit = length(subset(p, p <  low_limit))/length(p)
percent_below_lower_limit
If a study is designed to test a qualitative prediction (an experimental manipulation leads to an increase on an observed measure), power analysis can be used to plan a study so that it has a high probability of providing evidence for the hypothesis if the hypothesis is true. It does not matter whether the hypothesis is tested with p-values or with confidence intervals by showing that the confidence does not include zero.

Thus, power analysis seems useful even for the new statistics. However, Cummings is “ambivalent about statistical power” (p. 23). First, he argues that it has “no place when we use the new statistics” (p. 23), presumably because the new statistics never make dichotomous decisions.

Cumming’s next argument against power is that power is a function of the type-I error criterion. If the type-I error probability is set to 5% and power is only 33% (e.g., d = .5, between-group design N = 40), it is possible to increase power by increasing the type-I error probability. If type-I error rate is set to 50%, power is 80%. Cumming’s thinks that this is an argument against power as a statistical concept, but raising alpha to 50% is equivalent to reducing the width of the confidence interval by computing a 50% confidence interval rather than a 95% confidence interval. Moreover, researchers who adjust alpha to 50% are essentially saying that the null-hypothesis would produce a significant result in every other study. If an editor finds this acceptable and wants to publish the results, neither power analysis nor the reported results are problematic. It is true that there was a good chance to get a significant result when a moderate effect is present (d = .5, 80% probability) and when no effect is present (d = 0, 50% probability). Power analysis provides accurate information about the type-I and type-II error rates. In contrast, the new statistics provides no information about error rates in decision making because it is merely descriptive and does not make decisions.

Cumming then points out that “power calculations have traditionally been expected [by granting agencies], but these can be fudged” (p. 23). The problem with fudging power analysis is that the requested grant money may be sufficient to conduct the study, but insufficient to produce a significant result. For example, a researcher may be optimistic and expect a strong effect, d = .80, when the true effect size is only a small effect, d = .20. The researcher conducts a study with N = 52 participants to achieve 80% power. In reality the study has only 11% power and the researcher is likely to end up with a non-significant result. In the new statistics world this is apparently not a problem because the researcher can report the results with a wide confidence interval that includes zero, but it is not clear why a granting agency should fund studies that cannot even provide information about the direction of an effect in the population.

Cummings then points out that “one problem is that we never know true power, the probability that our experiment will yield a statistically significant result, because we do not know the true effect size; that is why we are doing the experiment!” (p. 24). The exclamation mark indicates that this is the final dagger in the coffin of power analysis. Power analysis is useless because it makes assumptions about effect sizes when we can just do an experiment to observe the effect size. It is that easy in the world of new statistics. The problem is that we do not know the true effect sizes after an experiment either. We never know the true effect size because we can never determine a population parameter, just like we can never prove the null-hypothesis. It is only possible to estimate population parameter. However, before we estimate a population parameter, we may simply want to know whether an effect exists at all. Power analysis can help in planning studies so that the sample mean shows the same sign as the population mean with a specified error rate.

Determining Sample Sizes in the New Statistics

Although Cumming does not find power analysis useful, he gives some information about sample sizes. Studies should be planned to have a specified level of precision. Cumming gives an example for a between-subject design with n = 50 per cell (N = 100). He chose to present confidence intervals for unstandardized coefficients. In this case, there is no fixed value for the width of the confidence interval because the sampling variance influences the standard error. However, for standardized coefficients like Cohen’s d, sampling variance will produce variation in standardized coefficients, while the standard error is constant. The standard error is simply 2 / sqrt (N), which equals SE = .2 for N = 100. This value needs to be multiplied by 2 to get the confidence interval, and the 95%CI = d +/- .4.   Thus, it is known before the study is conducted that the confidence interval will span 8/10 of a standard deviation and that an observed effect size of d > .4 is needed to exclude 0 from the confidence interval and to state with 95% confidence that the observed effect size would not have occurred if the true effect size were 0 or in the opposite direction.

The problem is that Cumming provides no guidelines about the level of precision that a researcher should achieve. Is 8/10 of a standard deviation precise enough? Should researchers aim for 1/10 of a standard deviation? So when he suggests that funding agencies should focus on precision, it is not clear what criterion should be used to fund research.

One obvious criterion would be to ensure that precision is sufficient to exclude zero so that the results can be used to state that direction of the observed effect is the same as the direction of the effect in the population that a researcher wants to generalize to. However, as soon as effect sizes are used in the planning of the precision of a study, precision planning is equivalent to power analysis. Thus, the main novel aspect of the new statistics is to ignore effect sizes in the planning of studies, but without providing guidelines about desirable levels of precision. Researchers should be aware that N = 100 in a between-subject design gives a confidence interval that spans 8/10 of a standard deviation. Is that precise enough?

Problem of Questionable Research Practices, Publication Bias, and Multiple Testing

A major problem for any statistical method is the assumption that random sampling error is the only source of error. However, the current replication crisis has demonstrated that reported results are also systematically biased. A major challenge for any statistical approach, old or new, is to deal effectively with systematically biased data.

It is impossible to detect bias in a single study. However, when more than one study is available, it becomes possible to examine whether the reported data are consistent with the statistical assumption that each sample is an independent sample and that the results in each sample are a function of the true effect size and random sampling error. In other words, there is no systematic error that biases the results. Numerous statistical methods have been developed to examine whether data are biased or not.

Cumming (2014) does not mention a single method for detecting bias (Funnel Plot, Eggert regression, Test of Excessive Significance, Incredibility-Index, P-Curve, Test of Insufficient Variance, Replicabiity-Index, P-Uniform). He merely mentions a visual inspection of forest plots and suggests that “if for example, a set of studies is distinctly too homogeneous – it shows distinctly less bouncing around than we would expect from sampling variability… we can suspect selection or distortion of some kind” (p. 23). However, he provides no criteria that explain how variability of observed effect sizes should be compared against predicted variability and how the presence of bias influences the interpretation of a meta-analysis. Thus, he concludes that “even so [biases may exist], meta-analysis can give the best estimates justified by research to date, as well as the best guidance for practitioners” (p. 23). Thus, the new statistics would suggest that extrasensory perception is real because a meta-analysis of Bem’s (2011) infamous Journal of Personality and Social Psychology article shows an effect with a tight confidence interval that does not include zero. In contrast, other researchers have demonstrated with old statistical tools and with the help of post-hoc power that Bem’s results are not credible (Francis, 2012; Schimmack, 2012).

Research Integrity

Cumming also advocates research integrity. His first point is that psychological science should “promote research integrity: (a) a public research literature that is complete and trustworthy and (b) ethical practice, including full and accurate reporting of research” (p. 8). However, his own article falls short of this ideal. His article does not provide a complete, balanced, and objective account of the statistical literature. Rather, Cumming (2014) cheery-picks references that support his claims and does not cite references that are inconvenient for his claims. I give one clear example of bias in his literature review.

He cites Ioannidis’s 2005 paper to argue that p-values and NHST is flawed and should be abandoned. However, he does not cite Ioannidis and Trikalinos (2007). This article introduces a statistical approach that can detect biases in meta-analysis by comparing the success rate (percentage of significant results) to the observed power of the studies. As power determines the success rate in an honest set of studies, a higher success rate reveals publication bias. Cumming not only fails to mention this article. He goes on to warn readers “beware of any power statement that does not state an ES; do not use post hoc power.” Without further elaboration, this would imply that readers should ignore evidence for bias with the Test of Excessive Significance because it relies on post-hoc power. To support this claim, he cites Hoenig and Heisey (2001) to claim that “post hoc power can often take almost any value, so it is likely to be misleading” (p. 24). This statement is misleading because post-hoc power is no different from any other statistic that is influenced by sampling error. In fact,Hoenig and Heisey (2001) show that post-hoc power in a single study is monotonically related to p-values. Their main point is that post-hoc power provides no other information than p-values. However, like p-values, post-hoc power becomes more informative, the higher it is. A study with 99% post-hoc power is likely to be a high powered study, just like extremely low p-values, p < .0001, are unlikely to be obtained in low powered studies or in studies when the null-hypothesis is true. So, post-hoc power is informative when it is high. Cumming (2014) further ignores that variability of post-hoc power estimates decreases in a meta-analysis of post-hoc power and that post-hoc power has been used successfully to reveal bias in published articles (Francis, 2012; Schimmack (2012). Thus, his statement that researchers should ignore post-hoc power analyses is not supported by an unbiased review of the literature, and his article does not provide a complete and trustworthy account of the public research literature.

Conclusion

I cannot recommend Cumming’s new statistics. I routinely report confidence intervals in my empirical articles, but I do not consider them as a new statistical tool. In my opinion, the root cause of the credibility crisis is that researchers conduct underpowered studies that have a low chance to produce the predicted effect and then use questionable research practices to boost power and to hide non-significant results that could not be salvaged. A simple solution to this problem is to conduct more powerful studies that can produce significant results when the predict effect exists. I do not claim that this is a new insight. Rather, Jacob Cohen has tried his whole life to educate psychologists about the importance of statistical power.

Here is what Jacob Cohen had to say about the new statistics in 1994 using time-travel to comment on Cumming’s article 20 years later.

“Everyone knows” that confidence intervals contain all the information to be found in significance tests and much more. They not only reveal the status of the trivial nil hypothesis but also about the status of non-nil null hypotheses and thus help remind researchers about the possible operation of the crud factor. Yet they are rarely to be found in the literature. I suspect that the main reason they are not reported is that they are so embarrassingly large! But their sheer size should move us toward improving our measurement by seeking to reduce the unreliable and invalid part of the variance in our measures (as Student himself recommended almost a century ago). Also, their width provides us with the analogue of power analysis in significance testing—larger sample sizes reduce the size of confidence intervals as they increase the statistical power of NHST” (p. 1002).

If you are looking for a book on statistics, I recommend Cohen’s old statistics over Cumming’s new statistics, p < .05.

Conflict of Interest: I do not have a book to sell (yet), but I strongly believe that power analysis is an important tool for all scientists who have to deal with uncontrollable variance in their data. Therefore I am strongly opposed to Cumming’s push for a new statistics that provides no guidelines for researchers how they can optimize the use of their resources to obtain credible evidence for effects that actually exist and no guidelines how science can correct false positive results.

The Test of Insufficient Variance (TIVA): A New Tool for the Detection of Questionable Research Practices

Finally corrected and updated the Introduction to Test of Insufficient Variance showing bias in Bem (2011) and Vohs (2006).

Replicability-Index

It has been known for decades that published results tend to be biased (Sterling, 1959). For most of the past decades this inconvenient truth has been ignored. In the past years, there have been many suggestions and initiatives to increase the replicability of reported scientific findings (Asendorpf et al., 2013). One approach is to examine published research results for evidence of questionable research practices (see Schimmack, 2014, for a discussion of existing tests). This blog post introduces a new test of bias in reported research findings, namely the Test of Insufficient Variance (TIVA).

TIVA is applicable to any set of studies that used null-hypothesis testing to conclude that empirical data provide support for an empirical relationship and reported a significance test (p-values).

Rosenthal (1978) developed a method to combine results of several independent studies by converting p-values into z-scores. This conversion uses the well-known fact…

View original post 1,468 more words