Category Archives: Uncategorized

Are You Planning a 10-Study Article? You May Want to Read This First

Here is my advice for researchers who are planning to write a 10-study article.  Don’t do it.

And here is my reason why.

Schimmack (2012) pointed out the problem of conducting multiple studies to test a set of related hypothesis in a single article (a.ka. multiple study articles).   The problem is that even a single study in psychology tend to have modest power to produce empirical support for a correct hypothesis (p < .05, two-tailed). This probability called statistical power is estimated to be 50 or 60% on average.   When researchers conduct multiple hypothesis tests, the probability of obtaining a significant result decreases exponentially. For 50% power, the probability that all tests provide a significant result halves with each study (.500, .250, .125, .063, etc.).

Schimmack (2012) used the term total power for the probability that a set of related hypothesis produces significant results. Few researchers who plan multiple study articles consider total power in the planning of their studies and multiple study articles do not explain how researchers deal with the likely outcome of a non-significant result. The most common practice is to simply ignore non-significant results and to report only results of studies that produced significant results. The problem with this approach is that the reported results overstate the empirical support for a theory, reported effect sizes are inflated, and researchers who want to build on these published findings are likely to end up with a surprising failure to replicate the original findings. A failed replication is surprising because the authors of the original article appeared to be able to obtain significant results in all studies. However, the reported success rate is deceptive and does not reveal the actual probability of a successful replication.

A number of statistical methods (TIVA, R-Index, P-Curve) have been developed to provide a more realistic impression of the strength and credibility of published results in multiple study articles. In this post, I used these tests to examine the evidence in a 10-study article in Psychological Science by Adam A. Galinski (Columbia Business School, Columbia University). I used this article because it is the article with the most studies in Psychological Science.

All 10 studies reported statistically significant results in support of the authors’ theoretical predictions.  An a priori power analysis suggests that authors who aim to present evidence for a theory in 10 studies need 98% (.80 raised to power of 1/10) power in each study to have an 80% probability to obtain significant results in all studies.

Each study reported several statistical results. I focused on the first focal hypothesis test to obtain statistical results for the examination of bias and evidential value. The p-values for each statistical test were converted into z-scores (inverse.normal (1-p/2).

Study N statistics p z obs.power
1 53 t(51)=2.71 0.009 2.61 0.74
2 61 t(59)=2.12 0.038 2.07 0.54
3 73 t(71)=2.78 0.007 2.7 0.77
4 33 t(31)=3.33 0.002 3.05 0.86
5 144 t(142)=2.04 0.043 2.02 0.52
6 83 t(79)=2.55 0.013 2.49 0.7
7 74 t(72)=2.24 0.028 2.19 0.59
8 235 t(233)=2.46 0.015 2.44 0.68
9 205 t(199)=3.85 0 3.78 0.97
10 109 t(104)=2.60 0.011 2.55 0.72

 

TIVA

The Test of Insufficient Variance was used to examine whether the variation in z-scores is consistent with the amount of sampling error that is expected for a set of independent studies (Var = 1).The variance in z-scores is less than one would expect from a set of 10 independent studies, Var(z) = .27. The probability that this reduction of variance occurred just by chance is p = .02.

Thus, there is evidence that the perfect 10 for 10 rate of significant results was obtained by means of dishonest reporting practices. Either failed studies were not reported or significant results were obtained with undisclosed research methods. For example, given the wide variation in sample sizes, optional stopping may have been used to obtain significant results. Consistent with this hypothesis, there is a strong correlation between sampling error (se = 2/sqrt[N]) and effect size (Cohen’s d = t * se) across the 10 studies, r(10) = .88.

R-INDEX

The median observed power for the 10 studies is 71%. Not a single study had observed power of 98% that is needed to have 80% total power.   Moreover, the 71% estimate is an inflated estimate of power because the success rate (100%) exceeds observed power (71%). After correcting for the inflation rate (100 – 71 = 29), the R-INDEX is 43%.

An R-Index of 43% is below 50%, suggesting that the true power of the studies is below 50% and that researchers who conduct an exact replication study are more likely to end up with a failure of replication than with a successful replication despite the apparent ability of the original authors to obtain significant results in all reported studies.

P-CURVE

A pcurve analysis shows that the results have evidential value, p = .02, using the convential criterion of p < .05. That is, it is unlikely that these 10 significant results were obtained without a real effect in at least one of the ten studies. However, excluding the most high-powered test in Study 9 renders the results of pcurve inconclusive, p = .11; that is, the hypothesis that the remaining 9 results were obtained without a real effect cannot be rejected at the conventional level of significance (p < .05).

These results show that the empirical evidence in this article is weak despite the impressive number of studies. The reason is that the absolute number of significant results is not an indicator of strength of evidence and that the reported rate of significant results is not an indicator of strength of evidence when non-significant results are not reported.

CONCLUSION

The statistical examination of this 10-study article reveals that the reported results are less robust than the 100% success rate suggests and that the reported results are unlikely to provide a complete account of the research program that generated the reported findings. Most likely, the researchers used optional stopping to increase their chances of obtaining significant results.

It is important to note that optional stopping is not necessarily a bad or questionable research practices. It is only problematic when the use of optional stopping is not disclosed. The reason is that optional stopping leads to biased effect size estimates and increases the type-I error probability, which invalidates the claim that results were significant at the nominal level that limits type-I error rates to 5%.

The results also highlight that the researchers were too ambitious in their goal to produce significant results in 10 studies. Even though their sample sizes are sometimes larger than the typical sample size in Psychological Science (N ~ 80), much larger samples would have been needed to produce significant results in all 10 studies.

It is also important that the article was published in 2013 and that it was common practice to exclude studies that fail to produce supporting evidence and to present results without full disclosure of the research methods that are used to produce these results at that time. Thus, the authors did not violate ethical standards of scientific integrity at that time.

However, publication standards are changing. When journals require full disclosure of data and methods, researchers need to change the way they plan their studies.  There are several options for researchers to change their research practices.

First, they can reduce the number of studies so that each study has a larger sample size and higher power to produce significant results. Authors who wish to report results from multiple studies need to take total power into account.  Eighty percent power in a single study is insufficient to produce significant results in multiple studies and power of each study needs to be adjusted accordingly (Total Power = Power ^ 1 / k;  ^ = raised to the power of,  k = number of studies).

Second, researchers can increase power by reducing the standard for statistical significance in a single study.  For example, it may be sufficient to claim support for a theory if 5 studies produced significant results with alpha = .20 (a 20% type-I error rate per study) because the combined type-I error rate decreases with the number of studies (total alpha = alpha ^ k).  Researchers can also conduct a meta-analysis of their individual studies to examine the total evidence across studies.

Third, researchers can specify a priori how many non-significant results they are willing to obtain and report. For example, researchers who plan 5 studies with 80% power can state that they expect one non-significant result.  An honest set of results will typically produce a variance in accordance with sampling theory (var(z) = 1) and median observed power would be 80% and there would be no inflation (expected success rate = 80% – expected median power = 80% = 0).  Thus, the R-Index would be 80 – 0 = 80.

In conclusion, there are many ways to obtain and report results of empirical results. There is only one way that is no longer an option, namely selective reporting of results that support theoretical predictions.  Statistical tests like TIVA, R-Index, P-Curve can reveal these practices and undermine the apparent value of articles that report many and only significant results.  As a result, the incentive structure is changing (again*) and researchers need to think hard about the amount of resources they really need to produce empirical results in multiple studies.

Footnotes

* The multiple-study article is a unique phenomenon that emerged in experimental psychology in the 1990s. It was supposed to provide more solid evidence and to protect against type-I errors in single study articles that presented exploratory results as if they confirmed theoretical predictions (HARKing).  However, dishonest reporting practices made it possible to produce impressive results without increased rigor.  At the same time, the allure of multiple study articles crowded out research that took time or required extensive resources to conduct only a single study.  As a result, multiple study articles often report studies that are quick (take less than 1 hour to complete) and cost little (Mturk participants are paid less than $1) or nothing (undergraduate students receive course credit).  Without real benefits and detrimental effects on the quality of empirical studies, I expected a decline in the number of studies per article and an increase in the quality of individual studies.

Advertisements

Dr. R Expresses Concerns about Results in Latest Psycholgical Science Article by Yaacov Trope and colleagues

This morning a tweet by Jeff Rouder suggested to take a closer look at an online first article published in Psychological Science.

When the Spatial and Ideological Collide

Metaphorical Conflict Shapes Social Perception

http://pss.sagepub.com/content/early/2016/02/01/0956797615624029.abstract

 

Trope1

The senior author of the article is Yaacov Trope from New York University. The powergraph of Yaacov Trope suggests that the average significant result that is reported in an article is based on a study with 52% in the years from 2000-2012 and 43% in the recent years from 2013-2015. The difference is probably not reliable, but the results show no evidence that Yaacov Trope has changed research practices in response to criticism of psychological research practices over the past five years.

Trope2.png

The average of 50% power for statistically significant results would suggest that every other test of a theoretical prediction produces a non-significant result. If, however, articles typically report that the results confirmed a statistical prediction, it is clear that dishonest reporting practices (excluding non-significant results or using undisclosed statistical methods like optional stopping) were used to present results that confirm theoretical predictions.

Moreover, the 50% estimate is an average. Power varies as a function of the strength of evidence and power for just significant results is lower than 50%. The range of z-scores from 2 to 2.6 approximately covers p-values in the range from .05 to .01 (just significant results). Average power for p-values in this range can be estimated by examining the contribution of the red (< 20% power), black (50% power) and green (85% power densities). In both graphs the density in this area is fully covered by the red and black lines, which implies that power is a mixture of 20% and 50%, which means power is less than 50%. Using the more reliable powergraph on the left, the red line (less than 20% power) covers a large portion of the area under the curve, suggesting that power for p-values between .05 and .01 is less than 33%.

The powergraph suggests that statistically significant results are only obtained with the help of random sampling error, reported effect sizes are inflated, and the probability of a false positive results is high because in underpowered studies the ratio of true positives vs. false positives is low.

In the article, Troope and colleagues report four studies. Casual inspection would suggest that the authors did conduct a rigorous program of research. They had relatively large samples (Ns = 239 to 410) and reported a priori power analyses that suggested they had 80% power to detect the predicted effects.

However, closer inspection with modern statistical methods to examine the robustness of results in a multiple study article show that the reported results cannot be interpreted at face value. To maintain statistical independence, I picked the first focal hypothesis test from each of the four studies.
CSV To HTML using codebeautify.org

Study N statistic p z obs.power
1 239 t(237)=2.06 0.04 2.053748911 0.537345692
2 391 t(389)=2.33 0.02 2.326347874 0.642947245
3 410 t(407)=2.13 0.03 2.170090378 0.583201432
4 327 t(325)=2.59 0.01 2.575829304 0.730996408

 

TIVA

TIVA examines whether a set of statistical results is consistent with the expected amount of sampling error. When test-statistics are converted into z-scores, sampling error should produce a variance of 1. However, the observed variance in the four z-scores is Var(z) = .05. Even with just four observations, a left-tailed chi-square test shows that this reduction in variance would occur rarely by chance, p = .02. This finding is consistent with the powergraph that shows reduced variance in z-scores because non-significant results that are predicted by the power analysis are not reported or significant results were obtained by violating sampling assumptions (e..g, undisclosed optional stopping).

R-INDEX

The able also shows that median observed power is only 61%, indicating that the a priori power analyses systematically overestimate power because they used effect sizes that were larger than the reported effect sizes. Moreover, the success rate in the four studies is 100%. When the success rate is higher than median observed power, actual power is even lower than observed power. To correct for this inflation in observed power, the R-Index subtracts the amount of inflation (100 – 61 = 39) from observed power. The R-Index is 61 – 39 = 22. Simulation studies show that an R-Index of 22 is obtained when the null-hypothesis is true (the predicted effect does not exist) and only significant results are being reported.

As it takes 20 studies to get 1 significant result by chance when the null-hypothesis is true, this model would imply that Troope and colleagues conducted another 4 * 20 – 4 = 76 studies with an average of 340 participants (a total of 25,973 participants) to obtain the significant results in their study. This is very unlikely. It is much more likely that Troope et al. used optional stopping to produce significant results.

Although the R-Index cannot reveal how the reported results were obtained, it does strongly suggest that these reported results will not be replicable. That is, other researchers who conduct the same study with the same sample sizes are unlikely to obtain significant results although Troope and colleagues reported getting significant results 4 out of 4 times.

P-Curve

TIVA and R-Index show that the reported results cannot be trusted at face value and that the reported effect sizes are inflated. These tests do not examine whether the data provide useful empirical evidence. P-Curve examines whether the data provide evidence against the null-hypothesis after taking into account that the results are biased. P-Curve shows that the results in this article do not contain evidential value (p = .69); that is, after correcting for bias the results do not reject the null-hypothesis at the convential p < .05 level.

Conclusion

Statisticians have warned psychologists for decades that only reporting significant results that support theoretical predictions is not science (Sterling, 1959). However, generations of psychologists have been trained to conduct research by looking for and reporting significant results that they can explain. In the past five years, a growing number of psychologists have realized the damage of this pseudo-scientific method for advancing understanding of human behavior.

It is unfortunate that many well-established researchers have been unable to change the way they conduct research and that the very same established researchers in their roles as reviewers and editors continue to let this type of research being published. It is even more unfortunate that these well-established researchers do not recognize the harm they are causing for younger researchers who end up with publications that tarnish their reputation.

After five years of discussion about questionable research practices, ignorance is no longer an excuse for engaging in these practices. If optional stopping was used, it has to be declared in the description of the sampling strategy. An article in a top journal is no longer a sure ticket to an academic job, if a statistical analysis reveals that the results are biased and do not contain evidential value.

Nobody benefits from empirical publications without evidential value. Why is it so hard to stop this nonsense?

A Scathing Review of “Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science”

J Pers Soc Psychol. 2015 Feb;108(2):275-97. doi: 10.1037/pspi0000007.
Best research practices in psychology: Illustrating epistemological and pragmatic considerations with the case of relationship science.
Finkel EJ, Eastwick PW, Reis HT.  

[link to free pdf]

The article “Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science” examines how social psychologists should respond to the crisis of confidence in the wake of scandales that rocked social psychology in 2011 (i.e., the Staple debacle and the Bem bust).
The article is written by prolific relationship researchers, Finkel, Eastwick, and Reis (FER), and is directed primarily at relationship researchers, but their article also has implications for social psychology in general. In this blog post, I critically examine FER’s recommendations for “best research practices.”

THE PROBLEM

FER and I are in general agreement about the problem. The goal of empirical science is to obtain objective evidence that can be used to test theoretical predictions. If the evidence supports a theoretical prediction, a theory that made this prediction gets to live another day. If the evidence does not support the prediction, the theory is being challenged and may need to be revised. The problem is that scientists are not disinterested observers of empirical phenomena. Rather, they often have a vested interest in providing empirical support for a theory. Moreover, scientists have no obligation to report all of their data or statistical analyses. As a result, the incentive structure encourages self-serving selection of supportive evidence. While data fabrication is a punishable academic offense, dishonest reporting practices have been and are still being tolerates.

The 2011 scandals led to numerous calls to curb dishonest reporting practices and to encourage or enforce honest reporting of all relevant materials and results. FER use the term “evidential value movement” to refer to researchers who have proposed changes to research practices in social psychology.

FER credit the evidential value movement with changes in research practices such as (a) reporting how sample sizes were determined to have adequate power to demonstrate predicted effects, (b) avoiding the use of dishonest research practices that inflate the strength of evidence and effect sizes, and (c) encouraging publications of replication studies independent of the outcome (i.e., a study may actually fail to provide support for a hypothesis).

FER propose that these changes are not necessarily to the benefit of social psychology. To make their point, they introduce Neyman-Pearson’s distinction between type-I errors ((a.k.a, false-positives) and type-II errors (a.k.a. false negatives). A type-I error occurs when a researcher draws the conclusion that an effect exists, but an effect does not exist (a cold remedy shows a statistically significant result in a clinical trial, but it has no real effect). A type-II error occurs when an effect exists, but a study fails to show a statistically significant result (e.g., a cold remedy does reduce cold symptoms, but a clinical trial fails to show a statistically significant result).
By convention, the type-I error rate in social psychology is set at 5%. This means, that in the long run no more than 5% of significant results in independent tests are false positive results and the maximum of 5% is only reached if all studies tested false hypotheses (i.e., they predicted an effect when no effect exists). As the number of true prediction increases, the actual rate of false-positive results decreases. If all hypothesis are true (the null-hypothesis that there is no effect is always false), the false-positive rate is 0 because it is impossible to make a type-I error. A maximum of 5% false-positive results has assured generations of social psychologists that most published results are likely to be true.

Unlike the type-I error probability that is set by convention, the type-II error probability is unknown because it depends on the unknown size of an effect. However, meta-analyses of actual studies can be used to estimate the typical type-II error probability in social psychology. In a seminal article, Cohen (1962) estimated that the type-II error rate is 50% for studies with a medium effect size. Power for studies of larger effects is higher and power for studies with smaller effects is lower. Actual power would depend on the distribution of small, large, and medium effects, but an estimate of 50% is a reasonable estimate. Cohen (1962) also proposed that a type-II error rate of 50% is unacceptably high and suggested that researchers should plan studies to reduce the type-II error rate to 20%. A common term for the complementary probability of avoiding a type-Il error is power (Power = 1 – Prob. Type-II Error) and Cohen suggested that psychologists plan studies with 80% power to detect effects that actually exist.

WHAT ARE THE TYPE-I and TYPE-II ERROR RATES IN PSYCHOLOGY?

Assuming that researcher follow Cohen’s recommendation (a questionable assumption) FER write “the field has, in principle, been willing to accept false positives 5% of the time and false negatives 20% of the time.” They then state in parenthesis that the “de facto false-positive and false-negative rates almost certainly have been higher than these nominal levels”.

In this parenthesis, FER hide the real problem that created the evidential value movement. The main point of the evidential value movement is that a type-I error probability of 5% does not tell us much about the false positive rate (how many false-positive results are being published) when dishonest reporting practices are allowed (Sterling, 1959).

For example, if a researcher conducts 10 tests of a hypothesis and only one test obtains a significant result and only the significant result is published, the probability of a false-positive result increased from 5% to 50%. Moreover, readers would be appropriately skeptical about a discovery that is matched by 9 failures to discover the same effect. In contrast, if readers only see the significant result, it seems as if the actual success rate is 100% rather than 10%. When only significant results are being reported, the 5% criterion no longer sets an upper limit and the real rate of false positive results could be 100% (Sterling, 1959).

The main goal of the evidential value movement is to curb dishonest reporting practices. A major theme in the evidential value movement is that editors and reviewers should be more tolerant of non-significant results, especially in multiple study articles that contain several tests of a theory (Schimmack, 2012). For example, in a multiple study paper with five studies and 80% power, one of the four studies is expected to produce a type-II error if the effect exists in all five studies. If power is only 50%, 2 or 3 studies should fail to provide statistically significant support for a hypothesis on their own.

Traditionally, authors excluded these studies from their multi-study articles and all studies provided support for their hypothesis. To reduce this dishonest reporting practice, editors should focus on the total evidence and allow for non-significant results in one or two studies. If four out of five studies produce a significant result, there is strong evidence for a theory and the evidence is stronger if all five studies are reported honestly.

Surprisingly, FER write that this change in editorial policy will “not necessarily alter the ratio of false positive to false negative errors” (p. ). This statement makes no sense because reporting of non-significant result that were previously hidden in file-drawers would reduce the percentage of type-I errors (relative to all published results) and increase the percentage of type-II errors that are being reported (because many non-significant results in underpowered studies are type-II errors). Thus, more honest reporting of results would increase the percentage of reported type-II errors and FER are confusing readers if they suggest that this is not the case.
Even more problematic is FER’s second scenario. Accordingly, researchers continue to conduct studies with low power (50%), submit manuscripts with multiple studies, where half the studies show statistically significant results and the other half do not, and editors reject these articles because they do not provide strong support for the hypothesis in all studies. FER anticipate that we would “see a marked decline in journal acceptance rates”. However, FER fail to mention a simple solution to this problem. Researchers could (and should) combine the resources that were needed to produce five studies with 50% powers to conduct one study that has a high probability of being successful (Schimmack, 2012). As a result, the type-I error rate and the type-II error rate would decrease. The type-I error rate would decrease because fewer tests are being conducted (e.g., conduct 10 studies to get 5 significant results, which doubles the probability that a significant result was obtained even if no effect exists). The Type-II error rate would decrease because researchers have more power to show the predicted effect without the use of dishonest research practices.

Alternatively, researchers can continue to conduct and report multiple underpowered studies, but abandon the elusive goal of finding significant results in each study. Instead, they could ignore significance tests of individual studies and conduct inferential statistical tests in a meta-analysis of all studies (Schimmack, 2012). The consequences for type-I and type-II error rates are the same as if researchers had conducted a single, more powerful study. Both approaches reduce type-I and type-II error rates because they reduce the number of statistical tests.

Based on their flawed reasoning, FER come to the wrong conclusion when they state “our point here is not that heightened stringency regarding false-positive rates is bad, but rather that it will almost certainly increase false-negative rates, which renders it less than an unmitigated scientific good.”

As demonstrated above, this statement is false because a reduction in statistical tests and an increase in power of each individual tests reduces the risk of type-I error rates and decreases the probability of making a type-II error (i.e., a false negative result).

WHAT IS AN ERROR BALANCED APPROACH?

As a result of FER’s false premise their recommendations for best practices that are based on this false premise are questionable.  In fact, it is not even clear what their recommendations are when they introduce their error balanced approach that is supposed to have three principles.

PRINCIPLE 1

The first principle is that both false positives and false negatives undermine the superordinate goals of science.

This principle is hardly controversial. It is problematic if a study shows that a drug is effective when the drug is actually not effective and it is problematic if an underpowered study fails to show that a drug is actually effective. FER fail to mention a long list of psychologists, including Jacob Cohen, who have tried to change the indifferent attitude of psychologists to non-significant results and the persistent practice of conducting underpowered studies that provide ample opportunity for multiple statistical tests so that at least one statistically significant result will emerge that can be used for a publication.

As noted earlier, the type-I error probability for a single statistical test is set at a maximum of 5%, but estimates of the type-II error probability are around 50%, a ten-fold difference. Cohen and others have advocated to increase power to 80%, which would reduce the type-II error risk to 20%. This would still imply that type-I error are considered more harmful than type-II errors by a ratio of 1:4 (5% vs. 20%).
Yet, FER do not recommend increasing statistical power, which would imply that the type-II error rate remains at 50%. The only other way to balance the two error rates would be to increase the type-I error rate. For example, one could increase the type-I error rate to 20%. As power increases when the significance criterion increases (becomes more liberal), this approach would also decrease the risk of type-II errors. The type-II error rate decreases when alpha is raised because results that were not significant are now significant. The risk is that more of these significant results are false-positives.  In a between-subject design with alpha = 5% (type-I error probability) and 50% power, power increases to 76% if alpha is raised to 20% and the two error probabilities are roughly matched (20% vs. 24%).

In sum, although I agree with FER that type-I and type-II errors are important, FER fail to mention how researchers should balance error rates and ignore the fact that the most urgent course of action is to increase power of individual studies.

PRINCIPLE 2

FER’s second principle is that neither type of error is “uniformly a greater threat to validity than the other type.”

Again, this is not controversial. In the early days of AIDS research, researchers and patients were willing to take greater risks in the hope that some medicine might work even if the probability of a false positive result in a clinical trial was high. When it comes to saving money in the supply of drinking water, a false negative result that the cheaper water is as healthy as the more expensive water is costly (of course, it is worse if it is well known that the cheaper water is toxic and politicians poison a population with toxic water).

A simple solution to this problem is to set the criterion value for an effect based on the implications of a type-I or a type-II error. However, in basic research no immediate actions have to be taken. The most common conclusion of a scientific article is that further research is needed. Moreover, researchers themselves can often conduct further research by conducting a follow-up study with more power. Therefore, it is understandable that the research community has been reluctant to increase the criterion for statistical significance from 5% to 20%

An interesting exception might be a multiple study article where a 5% criterion for each study makes it very difficult to obtain significant results in each study (Schimmack, 2012). One could adopt a more lenient 20% criterion for individual studies. A two study paper would already have only a 4% probability to produce a type-I error if both studies yielded a significant result (.20 * .20 = .04).

In sum, FER’s second principle about type-I and type-II errors is not controversial, but FER do not explain how the importance of type-I and type-II errors should influence the way researchers conduct their research and report their result. Most important, they do not explain why it would be problematic to report all results honestly.

PRINCIPLE III

FER’s third principle is that that “any serious consideration of optimal scientific practice must contend with both types of error simultaneously.”

I have a hard time distinguishing between principle I and principle III. Type-I and Type-II errors are both a problem and the problem of type-II errors in underpowered studies has been emphasized in a large literature on power with Jacob Cohen as the leading figure, but FER seem to be unaware of this literature or have another reason not to cite it, which reflects poorly on their scholarship. The simple solution to this problem has been outlined by Cohen: conduct fewer statistical tests with higher statistical power. FER have nothing to add to this simple statistical truth. A researcher who spends his whole live collecting data and at the end of his career conducts a single statistical test, and finds a significant result with p < .0001, is likely to have made a real discovery and a low probability to report a false positive result. In contrast, a researcher who publishes 100 statistical tests a year based on studies with low power will produce many false negative results and many false positive results.

This simple statistical truth implies that researchers have to make a choice. Do they want to invest their time and resources in many underpowered studies with many false positive and false negative results or do they want to invest their time and resources in a few high powered studies with few false positive and few false negative results?
Cohen advocated a slow and reliable approach when he said “less is more except for sample size.” FER fail to state where they stand because they started with the false premise that researchers can only balance the two types of errors without noticing that researchers can reduce both types of errors by conducting carefully planned studies with adequate power.

WHAT ABOUT HONESTY?

The most glaring omission in FER’s article is the lack of a discussion of dishonest reporting practices. Dishonest research practices are also called questionable research practices or p-hacking. Dishonest research practices make it difficult to distinguish between researchers who conduct carefully planned studies with high power from those who conduct many underpowered studies. If these researchers would report all of their results honestly, it would be easy to tell these two types of researchers apart. However, dishonest research practices allow researchers with underpowered studies to hide their false-negative results. As a result, the published record shows mostly significant results for both types of researchers, but this published record does not provide relevant information about the actual type-I and type-II errors being committed by the two researchers. The researcher with few, high powered studies has fewer unpublished non-significant results and a lower rate of published false positive results. The researcher with many underpowered studies has a large file-drawer filled with non-significant results that contains many false-negative results (discoveries that could have been made but were not made because the resources were spread too thin) and a higher rate of false-positive results in the published record.

The problem is that a system that tolerates dishonest reporting of results benefits researchers with many underpowered studies because they can publish more (true or false) discoveries and the number of (true or false) discoveries is used to reward researchers with positions, raises, awards, and grant money.

The main purpose of open science is to curb dishonest reporting practices. Preregistration makes it difficult to report a significant result that was not expected as predicted by a theory that was invented post-hoc after the results were known. Sharing of data sets makes it possible to check whether alternative analyses would have produced non-significant results. And rules about disclosing all measures makes it difficult to report only measures that produced a desired outcome. The common theme of all of these initiatives is to increase honesty. Rules that encourage or enforce honest reporting of all the evidence (good or bad) are assumed to be a guiding principle in science, but they are not being enforced and reporting only 3 studies with significant results when 15 studies were conducted is not considered a violation of scientific integrity.

What has been changing in the past years is a growing sense of awareness that dishonest reporting practices are harmful. Of course, it would have been difficult for FER to make a case for dishonest reporting practices and they do not make a positive case for dishonest reporting practices. However, they do present questionable arguments against recommendations that would curb questionable research practices and encourage honest reporting of results with the false argument that more honesty would increase the risk of type-II errors.

This argument is flawed because honest reporting of all results would provide an incentive for researchers to conduct more powerful studies that provide real support for a theory that can be reported honestly. Requirements to report all results honestly would also benefit researchers who conduct carefully planned studies with high power, which would reduce type-I and type-II error rates in the published literature. One might think everybody wins, but that is not the case. The losers in this new game would be researchers who have benefited from dishonest reporting practices.

CONCLUSION

FER’s article misrepresents the aims and consequences of the evidential value movement and fails to address the fundamental problem of allowing researches to pick and choose the results that they want to report. The consequences of tolerating dishonest reporting practices became visible in the scandals that rocked social psychology in 2011; the Stapel debacle and the Bem bust. Social psychology has been called a sloppy science. If social psychology wants to (re)gain respect from other psychologists, scientists, and the general public, it is essential that social psychologists enforce a code of conduct that requires honest reporting of results.

It is telling, that FER’s article appeared in the Interpersonal Relationship and Group Processes Section of the Journal of Personality and Social Psychology.  In the 2015 rankings of 106 psychology journals, JPSP:IRGP can be found at the bottom of the rankings with a rank of 99.  If relationship researchers take FER’s article as an excuse to resist changes in reporting practices, researchers may look towards other sciences (sociology) or other journals to learn about social relationships.

FER also fail to mention that new statistical developments have made it possible to distinguish between researches who conduct high-powered studies and those who use low-powered studies and report only significant results. These tools predict failures of replication in actual replication studies. As a result, the incentive structure is gradually changing and it is becoming more rewarding to conduct carefully-planned studies that can actually produce predicted results or in other words to be a scientist.

FINAL WORDS

It is 2016, five years after the 2011 scandals that started the evidential value movement.  I did not expect to see so much change in such a short time. The movement is gaining momentum and researchers in 2016 have to make a choice. They can be part of the solution or they can remain part of the problem.

VERY FINAL WORDS

Some psychologists do not like the idea that the new world of social media allows me to write a blog that has not been peer-reviewed.  I think that social media have liberated science and encourage real debate.  I can only imagine what would have happened if I had submitted this blog as a manuscript to JPSP:IRGP for peer-review.  I am happy to respond to comments by FER or other researchers and I am happy to correct any mistakes that I have made in the characterization of FER’s article or in my arguments about power and error rates.  Comments can be posted anonymously.

Keep your Distance from Questionable Results

Expression of Concern

http://pss.sagepub.com/content/19/3/302.abstract
doi: 10.1111/j.1467-9280.2008.02084.x

Lawrence E. Williams and
John A. Bargh

Williams and Bargh (2008) published the article “Keeping One’s Distance: The Influence of Spatial Distance Cues on Affect and Evaluation” in Psychological Science (doi: 10.1111/j.1467-9280.2008.02084.x)

As of August, 2015, the article has been cited 98 times in Web of Science.

The article reports four studies that appear to support the claim that priming individuals with the concept of spatial distance produced “greater enjoyment of media depicting embarrassment (Study 1), less emotional distress from violent media (Study 2), lower estimates of the number of calories in unhealthy food (Study 3), and weaker reports of emotional attachments to family members and hometowns (Study 4)”

However, a closer examination of the evidence suggests that the results of these studies were obtained with the help of questionable research methods that inflate effect sizes and the strength of evidence against the null-hypothesis (priming has no effect).

The critical test in the four studies was an Analysis of Variance that compared three experimental conditions.

The critical tests were:
F(2,67) = 3.14, p = .049, z = 1.96
F(2,39) = 4.37, p = .019, z = 2.34
F(2,56) = 3.36, p = .042, z = 2.03
F(2,81) = 4.97, p = .009, z = 2.60

The p-values can be converted into z-scores (norm.inv(1 – p/2)). The z-scores of independent statistical tests should follow a normal distribution and have a variance of 1. Insufficient variation in z-scores suggests that the results of the four studies are influenced by questionable research practices.

The variance of z-scores is Var(z) = 0.08. A chi-square test against the expected variance of 1 is significant, Chi-Square(df = 3) = .26, left-tailed p = .033.
The article reports 100% significant results, but median observed power is only 59%. With an inflation of 41%, the Replicability-Index is 59-41 = 18.

An R-Index of 18 is lower than the R-Index of 22, which would be obtained if the null-hypothesis were true and only significant results are reported. Thus, after correcting for inflation, the data provide no support for the alleged effect.

It is therefore not surprising that multiple replication attempts have failed to replicate the reported results. http://www.psychfiledrawer.org/chart.php?target_article=2

In conclusion, there is no credible empirical support for the theoretical claims in Williams and Bargh (2008) and the article should not be quoted as providing evidence for these claims.

 

Too good to be true: A reanalysis of Damisch, Stoberock, and Mussweiler (2010). Keep Your Fingers Crossed! How Superstition Improves Performance. Psychological Science, (21)7, p.1014-1020

This post was submitted as a comment to the R-Index Bulletin, but I think posting in a comment section of a blog reduces visibility. Therefore,  I am reposting this contribution as a post.  It is a good demonstration that article-based metrics can predict replication failures. Please consider submitting similar analyses to R-Index Bulletin or send me an email to post your findings anonymously or with author credit.

=================================================================

Too good to be true: A reanalysis of Damisch, Stoberock, and Mussweiler (2010). Keep Your Fingers Crossed! How Superstition Improves Performance. Psychological Science, (21)7, p.1014-1020

Preliminary note:
Test statistics of the t-tests on p.1016 (t(48) = 2.0, p < .05 and t(48) = 2.36, p < .03) were excluded from the following analyses as they served just as manipulation checks. The t-test reported on p.1017 (t(39) = 3.07, p < .01) was also excluded because mean differences in self-efficacy represent a mere exploratory analysis.

One statistical test reported a significant finding with F(2, 48) = 3.16, p < .05. However, computing the p-value with R gives a p-value of 0.051, which is above the criterion value of .05. For this analysis, the critical p-value was set to p = .055 to be consistent with the interpretation of the test as significant evidence in favor of the authors’ hypothesis.

R-Index analysis:
Success rate = 1
Mean observed power = 0.5659
Median observed power = 0.537
Inflation rate = 0.4341
R-Index = 0.1319

Note that, according to http://www.r-index.org/uploads/3/5/6/7/3567479/r-index_manual.pdf (p.7):
“An R-Index of 22% is consistent with a set of studies in which the null-hypothesis is true and a researcher reported only significant results”.

Furthermore, the test of insufficient variance (TIVA) was conducted.
Note that variances of z-values < 1 suggest bias. The chi2 test tests the H0 that variance = 1.
Results:
Variance = 0.1562
Chi^2(7) = 1.094; p = .007

Thus, the insufficient variance in z-scores of .156 suggests that it is extremely likely that the reported results overestimate the population effect and replicability of the reported studies.

It should be noted that the present analysis is consistent with earlier claims that these results are too good to be true based on Francis’s Test of Excessive Significance (Francis et al., 2014; http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0114255).

Finally, the study results were analyzed using p-curve (http://p-curve.com/):

Statistical Inference on p-curve:
Studies contain evidential value:
chisq(16) = 10.745; p = .825
Note that a significant p-value indicates that the p-curve is right-skewed, which indicates evidential value.

Studies lack evidential value:
chisq(16) = 36.16; p = .003
Note that a significant p-value indicates that the p-curve is flatter than one would expect if studies were powered at 33%, which indicates that the results have no evidential value.

Studies lack evidential value and were intensely p-hacked :
chisq(16) = 26.811; p = .044
Note that a significant p-value indicates that the p-curve is left-skewed, which indicates p-hacking/selective reporting.

All bias tests suggest that the reported results are biased. Consistent with these statistical results, a replication study failed to reproduce the original findings (see https://osf.io/fsadm/)

Because all studies were conducted by the same team of researchers the bias cannot be attributed to publication bias. Thus, it appears probable that questionable research practices were used to produce the observed significant results. A possible explanation might be that the authors ran multiple studies and reported just those that produced significant results.

In conclusion, researchers should be suspicious about the power of superstition or at least keep their fingers crossed when they attempt to replicate the reported findings.

A Revised Introduction to the R-Index

A draft of this manuscript was posted in December, 2014 as a pdf file on http://www.r-index.org.   I have received several emails about the draft.  This revised manuscript does not include a comparison of different bias tests.  The main aim is to provide an introduction to the R-Index and to correct some misconceptions of the R-Index that have become apparent over the past year.

Please cite this post as:  Schimmack, U. (2016). The Replicability-Index: Quantifying Statistical Research Integrity.  https://wordpress.com/post/replication-index.wordpress.com/920

Author’s Note. I would like to thank Gregory Francis, Julia McNeil, Amy Muise, Michelle Martel, Elizabeth Page-Gould, Geoffrey MacDonald, Brent Donnellan, David Funder, Michael Inzlicht, and the Social-Personality Research Interest Group at the University of Toronto for valuable discussions, suggestions, and encouragement.

Abstract

Researchers are competing for positions, grant money, and status. In this competition, researchers can gain an unfair advantage by using questionable research practices (QRPs) that inflate effect sizes and increase the chances of obtaining stunning and statistically significant results. To ensure fair competition that benefits the greater good, it is necessary to detect and discourage the use of QRPs. To this aim, I introduce a doping test for science; the replicability index (R-index). The R-Index is a quantitative measure of research integrity that can be used to evaluate the statistical replicability of a set of studies (e.g., journals, individual researchers’ publications).  A comparison of the R-Index for the Journal of Abnormal and Social Psychology in 1960 and the Attitudes and Social Cognition section of the Journal of Social and Personality Psychology in 2011 shows an increase in the use of QRPs. Like doping tests in sports, the availability of a scientific doping test should deter researchers from engaging in practices that advance their careers at the expense of everybody else. Demonstrating replicability should become an important criterion of research excellence that can be used by funding agencies and other stakeholders to allocate resources to research that advances science.
Keywords: Power, Publication Bias, Significance, Credibility, Sample Size, Questionable Research Methods, Replicability, Statistical Methods

INTRODUCTION

It has been known for decades that published results are likely to be biased in favor of authors’ theoretical inclinations (Sterling, 1959). The strongest scientific evidence for publication bias stems from a comparison of the rate of significant results in psychological journals and the statistical power of published studies. Statistical power is the long-run probability to obtain a significant result, when the null-hypothesis is false (Cohen, 1988). The typical statistical power of psychological studies has been estimates to be around 60% (Sterling, Rosenbaum, & Weinkam, 1995). However, the rate of significant results in psychological journals is over 90% (Sterling, 1959; Sterling et al., 1995). The discrepancy between these estimates of power reveals that published studies are biased and that some findings may be simply false positive results, whereas other studies report inflated effect size estimates.

It has been overlooked that estimates of statistical power are also inflated by the use of questionable research methods. Thus, the commonly reported estimate that typical power in psychological studies is 60% is an inflated estimate of true power (Schimmack, 2012). If the actual power is less than 50%, it means that a typical study in psychology has a larger probability to fail (produce a false negative result) than to succeed (rejecting a false null-hypothesis). Conducting such low powered studies is extremely wasteful. Moreover, few researchers have resources to discard 50% of their empirical output. As a result, the incentive for the use of questionable research practices that inflate effect sizes is strong.

Not surprisingly, the use of questionable research practices is common (John et al., 2012). More than 50% of anonymous respondents reported selective reporting of dependent variables, dropping experimental conditions, or not reporting studies that did not support theoretical predictions. The widespread use of QRPs undermines the fundamental assumption of science that scientific theories have been subjected to rigorous empirical tests. In violation of this assumption, QRPs allow researchers to find empirical support for hypotheses even when these hypotheses are false.

The most dramatic example was Bem’s (2011) infamous evidence of time-reversed causality (e.g., studying after a test can improve test performance). Although Bem reported nine successful studies, subsequent studies failed to replicate this finding and raised concerns about the integrity of Bem’s studies (Schimmack, 2012). One possibility for false positive results could be that a desirable outcome occurred by chance and a researcher mistakes this fluke finding as evidence that a prediction was true. However, a fluke finding is unlikely to repeat itself in a series of studies. Statistically, it is highly improbable that Bem’s results are simple type-I errors because the chance of obtaining 9 out of 10 type-I errors with a probability of .05 is less than 1 out of 53 billion (1 / 53,610,771,049). This probability is much smaller than the probability of winning the lottery (1 / 14 million). It is also unlikely that Bem simply failed to report studies with non-significant results because he would have needed 180 studies (9 x 20) to obtain 9 significant results because a type-I error of 5% implies that a significant result will occur, on average, for every 20 studies. With sample sizes of about 100 participants in reported studies, this would imply that Bem tested 18,000 participants. It is therefore reasonable to conclude that Bem used questionable research methods to produce his implausible and improbable results.

Although the publication of Bem’s article in a flagship journal of psychology was a major embarrassment for psychologists, it provided an opportunity to highlight fundamental problems in the way psychologists produced and published empirical results. There have been many valuable suggestions and initiatives to increase the integrity of psychological science (e.g., Asendorpf et al., 2012). In this manuscript, I propose another solution to the problem of QRPs; I suggest that scientific organizations ban the use of questionable research practices, just like sports organizations ban the use of performance enhancing substances. At present, scientific organization only ban and punish outright manipulation of original data. However, excessive use of QRPs can produce fake results without fake data. As the ultimate product of an empirical science are the results of statistical analyses, it does not matter whether fake results were obtained with fake data or with questionable statistical analyses.  The use of QRPs therefore violates the code of ethics in science that a researcher should base conclusions on an objective and unbiased analyses of empirical data. Dropping studies or dependent variables that do not support a hypothesis violates this code of scientific integrity.

Unfortunately, the world of professional sports also shows that doping bans are ineffective unless they are enforced by regular doping tests. Thus, a ban of questionable research practices needs to be accompanied by objective tests that can reveal the use of questionable research practices. The main purpose of this article is to introduce a statistical test that reveals the use of questionable research practices that can be used to enforce a ban of such practices. This quantitative index of research integrity can be used by readers, editors, and funding agencies to ensure that only rigorous empirical studies are published or funded.

The Replicability-index

The R-index is based on power theory (Cohen, 1988). Statistical power is defined as the long-run probability of obtaining statistically significant results in a series of studies (see Schimmack, 2016, for more details). A study with 50% power is expected to produce 50 significant results and 50 non-significant results. In the short-run, the actual number of significant results can underestimate or overestimate the true power of a study, but in an unbiased set of studies, the long-run percentage of significant result provides an unbiased estimate of average power (see Schimmack, 2016, for details on meta-analysis of power). Importantly, in smaller sets of studies underestimation is as likely as overestimation. However, Sterling (1959) was the first to observe that scientific journals report more significant results than the actual power of studies justifies. In other words,  a simple count of significant results provides an inflated estimate of observed power.

A simple count of the percentage of significant results in journals would suggest that psychological studies have over 90% statistical power to reject the null-hypothesis. However, studies of the typical power in psychology based on sample sizes and a moderate effect size suggest that the typical power of statistical tests in psychology is around 60% (Giegerenzer & Sedelmeier, 1995; see also Schimmack, 2016).

The discrepancy between these estimates of power reveals a systematic bias because these estimates should converge in the long run. Discrepancies between the two estimates of power can be tested for significance. Schimmack (2012) developed the incredibility index to examine whether a set of studies reported too many significant results. For example, the probability that 10 studies with 60% power produce 90% significant results (9 significant and 1 non-significant) is p = .046 (binomial prob. calculator).  The incredibility index uses 1 – p, so that higher numbers show that the result is incredible because there should have been more non-significant results.  In this example, the incredibility index is 1 – .046 = .954.  This result suggests that the reported results were selected to provide stronger evidence for a hypothesis than the full set of results would have provided; in other words, questionable research practices were used to produce the reported results.

Some critics have argued that the incredibility index is flawed because it relies on observed effect sizes to estimate power.  These power estimates are called observed power or post-hoc power and statisticians have warned against the computation of observed power (Henning & Hoenig, 2001). However, this objection is flawed because Henning and Hoenig (2001) only examined the usefulness of computing observed power for a single statistical test (Schimmack, 2016).  The problem of observed power estimates for a single statistical test is that the confidence interval around the estimate is so large that it often covers the full range of possible estimates from 0 (or more accurately, the alpha criterion of significance) to 1 (Schimmack, 2015). This estimate is not fundamentally flawed, but it is uninformative.   However, in a meta-analysis of power estimates, sampling error decreases, the confidence interval around the power estimate shrinks, and the power estimate becomes more accurate and useful. Thus, a meta-analysis of studies can be used to estimate power and to compare the success rate (percentage of significant results) to the power estimate.

The incredibilty index computed a power estimate for each study and then averaged these power estimates to obtain an estimate of average observed power.  A binomimal probability test was then used to compute the probability that a set of reported results reported too few non-significant results.

The R-Index builds on the incredibility index. One problem of the incredibility index is that probabilities provide no information about effect sizes.   An incredibility index of 99% can be obtained with 10 studies that produced 10 significant results with an average observed power of 60% or with 100 studies that produced 100% significant results with average observed power of 95%.  Evidently, average observed power of 95% is very high and the fact that one would expect only 95 significant results while 100 significant results were reported suggests only a small bias. In contrast, the discrepancy between 60% observed power and 100% reported results is large.  The fact that the same incredibility index can be obtained for different amount of bias is nothing special. Probabilities are always a function of the magnitude of an effect (discrepancy) and the amount of sampling error, which is inversely related to sample size.  For this reason, it is important to complement information about probabilities with effect size measures.  For the incredibility index, the effect size is the difference between the success-rate and the observed power estimate.  In this example, the effect sizes are 100-60 = 40 vs. 100-95 = 5.  This effect size is called the inflation rate, because it is expected that the success rate exceeds observed power.

In large sets of studies (e.g., an entire volume of a journal), the IC-index is useless because it will merely reveal the well-known presence of publication bias and QRPs, and the p-value is influenced by the number of tests in a journal.  A journal with more articles and statistical tests would have a lower incredibility index even if the studies, on average, have more power and are less biased.  The inflation rate provides a better measure of the integrity of reported results in a journal.

Another problem of the incredibility index is that power is not normally or symmetrically distributed. As a result, the average observed power estimate is a biased estimate of the average true power (Yuan & Maxwell, 2005; Schimmack, 2015). For example, when the true power is close to the upper value of 100%, observed power is more likely to underestimate than to overestimate true power. To overcome this problem, the R-Index uses the median to estimate true power. The median is unbiased because in each study it is equally likely that the observed effect size underestimates or overestimates the true effect size. Thus, it is equally likely that a power estimate underestimates or overestimates true power. While the amount of underestimation and overestimation is not symmetrically distributed, the direction of bias is known to be equally distributed on both sides of true power. Simulations confirm that the median provides an unbiased estimate of true power even when power is high.

Thus, the formula for the inflation in a set of studies is

Inflation  = Percentage of Significant Results – Median (Estimated Power)

Median observed power is an unbiased estimate of power in an unbiased set of studies.  However, if the set of studies is biased by publication bias,  median observed power is inflated.  It is still able to detect publication bias because the success rate increases faster than median observed power.  For example, if true power is 50%, but only significant results are reported (100% success rate),  median observed power increases to 75% (Schimmack, 2015).

The amount of inflation is proportional to the actual power of a set of studies.  When the set of studies includes only significant results (100% success rate), inflation is necessarily greater than 0 because power is never 100%.  However, median observed power of 95% implies only a small amount of inflation (5%) and the actual power is close to the median observed power (94%). In contrast, median observed power of 70% implies a large amount of bias, and true power is only 30%.  As a result, the true power of a set of studies increases with the median observed power and decreases with the amount of inflation.  The R-Index combines these two indicators of power by subtracting the inflation rate from median observed power.

R-Index = Median Observed Power – Inflation

As Inflation = Success Rate – Median Observed Power, the R-Index can also be expressed as a function of Success Rate and Median Observed Power

R-Index = Median Observed Power – (Success Rate – Median Observed Power)

or

R-Index = 2 * Median Observed Power – Success Rate

The R-Index can range from 0 to 1.  A value of 0 is obtained when median observed power is 50% and the success rate is 100%.   However, this event should not occur with real data because significant results have a minimum observed power of 50%.  To obtain a median of 50% observed power all studies would have to have 50% power, but sampling error should produce variation in observed power estimates.  A fixed value or restricted variance is another indicator of bias (Schimmack, 2015).  A more realistic lower limit for the R-Index is a value of 22%.  This value is obtained when the null-hypothesis is true (the population effect size is zero) and only significant results are reported (success rate = 100%).  In this case, median observed power is 61%, the inflation rate is 39%, and the R-Index is 61 – 39 = 22.  The maximum of 100 would be obtained if studies practically have 100% power and the success rate is 100%.

It is important to note that the R-Index is not an estimate of power.  It is monotonically related to power, but an R-Index of 22% does not imply that a set of studies has 22% power.  As noted earlier, an R-Index of 22% is obtained when the null-hypothesis is true which produces only 5% significant results if the significance criterion is 5%.  When power is less than 50%, the R-Index is conservative and the Index values are higher than true power. When power is more than 50%, the R-Index values are lower than true power.   However, for comparisons of journals, authors, etc., rankings with the R-Index will reflect the ranking in terms of true power.  Moreover, an R-Index below 50% implies that true power is less than 50%, which can be considered inadequate power for most research questions.

Example 1:  Bem’s Feeling the Future

The first example uses Bem’s (2011) article to demonstrate the usefulness of computing an R-Index.

 

 

N d Obs.Pow Success
100 0.25 0.79 1
150 0.2 0.78 1
100 0.26 0.82 1
100 0.23 0.73 1
100 0.22 0.70 1
150 0.15 0.57 1
150 0.14 0.52 1
200 0.09 0.35 0
100 0.19 0.59 1
50 0.42 0.88 1

The median observed power is 71%. The success rate is 90% Accordingly, the inflation rate is 90 – 71 = 19%. The R-Index is 71 – 19 = 52. An R-Index of 52 is higher than the 22% that is expected from a set of studies without a real effect and publication bias. However, it is not clear how questionable research practices influence the R-Index. Thus, the R-Index should not be used to infer from values greater than 22% that an effect is present. The R-Index does suggest that Bem’s studies did not have 80% power as he suggested in the planning of his studies. It also suggests that the nominal median effect size of d = .21 is inflated and that future studies should expect a lower effect size. These predictions were confirmed in a set of replication studies (Galak et al., 2013) In short, an R-Index of 50% raises concerns about the robustness of empirical results and shows that impressive success rates of 90% or more do not necessarily provide strong evidence for the existence of an effect.

Example 2: The Multiple Lab Project

In the wake of the replicability crisis, the Open-Science Fouundation has started to examine the replicability of psychological research with replication studies. These replication studies reproduce the original studies as closely as possible.  The first results emerged from the Many-Labs project. In this project, an international team of researchers replicated 13 psychological studies in several laboratories. The main finding of this project was that 10 of the 13 studies were successfully replicated in several labs.  The success rate is 77%.  I computed the R-Index for the original studies. One study provided insufficient information to compute observed power, leaving 12 studies to be analyzed.  The success rate for the original studies was 100% (one study had a marginally significant effect, p < .10, two-tailed).  Median observed power was 86%.  The inflation rate is 100 – 86 = 14, and the R-Index is 86 – 14 = 72.   Thus, an R-Index of 72 suggests that studies have a high probability of replicating.  Of course, a higher R-Index would be even better.

It is important to note that success in the Many Lab Project was defined as a significant result in a meta-analysis across all labs with over 3,000 participants.  The success rate would be lower if replication success were defined as a significant result in an exact replication study with the same statistical power (sample size) as the original study. Nevertheless, many of the results were replicated even with smaller sample sizes because the original studies examined large effects, had large samples, or both.

Conclusion

It has been widely recognized that questionable research practice are threatening the foundations of science. This manuscript introduces the R-Index as a statistical tool to assess the replicability of published results. Results are replicable if the original studies had sufficient power to produce significant results. A study with 80% power is likely to produce a significant result in 80% of all attempts without the need for questionable research practices. In contrast, a study with 20% power can only produce significant results with the help of inflated effect sizes. In 20% of all attempts, luck alone will be sufficient to inflate effect sizes. In all other cases, researchers have to hide failed attempts in file drawers or use questionable statistical practices to inflate effect sizes. The R-Index reveals the presence of questionable research practices when observed power is lower than the rate of significant results. The R-Index has two components. It increases with observed power because studies with high power are more likely to replicate. The second component is the discrepancy between the percentage of significant results and observed power. The greater the discrepancy, the more questionable research practices have contributed to success and the more observed power overestimates true power.

References

Bem, D. J. (2011). Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect. [Article]. Journal of Personality and Social Psychology, 100(3), 407-425. doi: 10.1037/a0021524

Galak, J., LeBoeuf, R. A., Nelson, L. D., & Simmons, J. P. (2013). Correcting the Past: Failures to Replicate Psi. Journal of Personality and Social Psychology.

Hoenig, J. M., & Heisey, D. M. (2001). The abuse of power: The pervasive fallacy of power calculations for data analysis. [Article]. American Statistician, 55(1), 19-24. doi: 10.1198/000313001300339897

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the Prevalence of Questionable Research Practices With Incentives for Truth Telling. [Article]. Psychological Science, 23(5), 524-532. doi: 10.1177/0956797611430953

Schimmack, U. (2012). The Ironic Effect of Significant Results on the Credibility of Multiple-Study Articles. [Article]. Psychological Methods, 17(4), 551-566. doi: 10.1037/a0029487

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies. [Article]. Psychological Bulletin, 105(2), 309-316.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance: Or vice versa. [Article]. Journal of the American Statistical Association, 54(285), 30-34. doi: 10.2307/2282137

Sterling, T. D., Rosenbaum, W. L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice-versa. [Editorial Material]. American Statistician, 49(1), 108-112.

2015 Replicability Ranking of 100+ Psychology Journals

Replicability rankings of psychology journals differs from traditional rankings based on impact factors (citation rates) and other measures of popularity and prestige. Replicability rankings use the test statistics in the results sections of empirical articles to estimate the average power of statistical tests in a journal. Higher average power means that the results published in a journal have a higher probability to produce a significant result in an exact replication study and a lower probability of being false-positive results.

The rankings are based on statistically significant results only (p < .05, two-tailed) because only statistically significant results can be used to interpret a result as evidence for an effect and against the null-hypothesis.  Published non-significant results are useful for meta-analysis and follow-up studies, but they provide insufficient information to draw statistical inferences.

The average power across the 105 psychology journals used for this ranking is 70%. This means that a representative sample of significant results in exact replication studies is expected to produce 70% significant results. The rankings for 2015 show variability across journals with average power estimates ranging from 84% to 54%.  A factor analysis of annual estimates for 2010-2015 showed that random year-to-year variability accounts for 2/3 of the variance and that 1/3 is explained by stable differences across journals.

The Journal Names are linked to figures that show the powergraphs of a journal for the years 2010-2014 and 2015. The figures provide additional information about the number of tests used, confidence intervals around the average estimate, and power estimates that estimate power including non-significant results even if these are not reported (the file-drawer).

Rank   Journal 2010/14 2015
1   Social Indicators Research   81   84
2   Journal of Happiness Studies   81   83
3   Journal of Comparative Psychology   72   83
4   International Journal of Psychology   80   81
5   Journal of Cross-Cultural Psychology   78   81
6   Child Psychiatry and Human Development   75   81
7   Psychonomic Review and Bulletin   72   80
8   Journal of Personality   72   79
9   Journal of Vocational Behavior   79   78
10   British Journal of Developmental Psychology   75   78
11   Journal of Counseling Psychology   72   78
12   Cognitve Development   69   78
13   JPSP: Personality Processes
and Individual Differences
  65   78
14   Journal of Research in Personality   75   77
15   Depression & Anxiety   74   77
16   Asian Journal of Social Psychology   73   77
17   Personnel Psychology   78   76
18   Personality and Individual Differences   74   76
19   Personal Relationships   70   76
20   Cognitive Science   77   75
21   Memory and Cognition   73   75
22   Early Human Development   71   75
23   Journal of Sexual Medicine   76   74
24   Journal of Applied Social Psychology   74   74
25   Journal of Experimental Psychology: Learning, Memory & Cognition   74   74
26   Journal of Youth and Adolescence   72   74
27   Social Psychology   71   74
28   Journal of Experimental Psychology: Human Perception and Performance   74   73
29   Cognition and Emotion   72   73
30   Journal of Affective Disorders   71   73
31   Attention, Perception and Psychophysics   71   73
32   Evolution & Human Behavior   68   73
33   Developmental Science   68   73
34   Schizophrenia Research   66   73
35   Achive of Sexual Behavior   76   72
36   Pain   74   72
37    Acta Psychologica   72   72
38   Cognition   72   72
39   Journal of Experimental Child Psychology   72   72
40   Aggressive Behavior   72   72
41   Journal of Social Psychology   72   72
42   Behaviour Research and Therapy   70   72
43   Frontiers in Psychology   70   72
44   Journal of Autism and Developmental Disorders   70   72
45   Child Development   69   72
46   Epilepsy & Behavior   75   71
47   Journal of Child and Family Studies   72   71
48   Psychology of Music   71   71
49   Psychology and Aging   71   71
50   Journal of Memory and Language   69   71
51   Journal of Experimental Psychology: General   69   71
52   Psychotherapy   78   70
53   Developmental Psychology   71   70
54   Behavior Therapy   69   70
55   Judgment and Decision Making   68   70
56   Behavioral Brain Research   68   70
57   Social Psychology and Personality Science   62   70
58   Political Psychology   75   69
59   Cognitive Psychology   74   69
60   Organizational Behavior and Human Decision Processes   69   69
61   Appetite   69   69
62   Motivation and Emotion   69   69
63   Sex Roles   68   69
64   Journal of Experimental Psychology: Applied   68   69
65   Journal of Applied Psychology   67   69
66   Behavioral Neuroscience   67   69
67   Psychological Science   67   68
68   Emotion   67   68
69   Developmental Psychobiology   66   68
70   European Journal of Social Psychology   65   68
71   Biological Psychology   65   68
72   British Journal of Social Psychology   64   68
73   JPSP: Attitudes & Social Cognition   62   68
74   Animal Behavior   69   67
75   Psychophysiology   67   67
76   Journal of Child Psychology and Psychiatry and Allied Disciplines   66   67
77   Journal of Research on Adolescence   75   66
78   Journal of Educational Psychology   74   66
79   Clinical Psychological Science   69   66
80   Consciousness and Cognition   69   66
81   The Journal of Positive Psychology   65   66
82   Hormones & Behavior   64   66
83   Journal of Clinical Child and
Adolescence Psychology
  62   66
84   Journal of Gerontology: Series B   72   65
85   Psychological Medicine   66   65
86   Personalit and Social Psychology
Bulletin
  64   64
87   Infancy   61   64
88   Memory   75   63
89   Law and Human Behavior   70   63
90   Group Processes & Intergroup Relations   70   63
91   Journal of Social and Personal Relationships   69   63
92   Cortex   67   63
93   Journal of Abnormal Psychology   64   63
94   Journal of Consumer Psychology   60   63
95   Psychology of Violence   71   62
96   Psychoneuroendocrinology   63   62
97   Health Psychology   68   61
98   Journal of Experimental Social
Psychology
  59   61
99   JPSP: Interpersonal Relationships
and Group Processes
  60   60
100   Social Cognition   65   59
101   Journal of Consulting and Clinical Psychology   63   58
102   European Journal of Personality   72   57
103   Journal of Family Psychology   60   57
104   Social Development   75   55
105   Annals of Behavioral Medicine   65   54
106   Self and Identity   63   54