All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Replicability Ranking of Psychology Departments

Evaluations of individual researchers, departments, and universities are common and arguably necessary as science is becoming bigger. Existing rankings are based to a large extent on peer-evaluations. A university is ranked highly if peers at other universities perceive it to produce a steady stream of high-quality research. At present the most widely used objective measures rely on the quantity of research output and on the number of citations. These quantitative indicators of research quality work are also heavily influenced by peers because peer-review controls what gets published, especially in journals with high rejection rates, and peers decide what research they cite in their own work. The social mechanisms that regulate peer-approval are unavoidable in a collective enterprise like science that does not have a simple objective measure of quality (e.g., customer satisfaction ratings, or accident rates of cars). Unfortunately, it is well known that social judgments are subject to many biases due to conformity pressure, self-serving biases, confirmation bias, motivated biases, etc. Therefore, it is desirable to complement peer-evaluations with objective indicators of research quality.

Some aspects of research quality are easier to measure than others. Replicability rankings focus on one aspect of research quality that can be measured objectively, namely the replicability of a published significant result. In many scientific disciplines such as psychology, a successful study reports a statistically significant result. A statistically significant result is used to minimize the risk of publishing evidence for an effect that does not exist (or even goes in the opposite direction). For example, a psychological study that shows effectiveness of a treatment for depression would have to show that the effect in the study reveals a real effect that can be observed in other studies and in real patients if the treatment is used for the treatment of depression.

In a science that produces thousands of results a year, it is inevitable that some of the published results are fluke findings (even Toyota’s break down sometimes). To minimize the risk of false results entering the literature, psychology like many other sciences, adopted a 5% error rate. By using a 5% as the criterion, psychologists ensured that no more than 5% of results are fluke findings. With thousands of results published in each year, this still means that more than 50 false results enter the literature each year. However, this is acceptable because a single study does not have immediate consequences. Only if these results are replicated in other studies, findings become the foundation of theories and may influence practical decisions in therapy or in other applications of psychological findings (at work, in schools, or in policy). Thus, to outside observers it may appear safe to trust published results in psychology and to report about these findings in newspaper articles, popular books, or textbooks.

Unfortunately, it would be a mistake to interpret a significant result in a psychology journal as evidence that the result is probably true.  The reason is that the published success rate in journals has nothing to do with the actual success rate in psychological laboratories. All insiders know that it is common practice to report only results that support a researcher’s theory. While outsiders may think of scientists as neutral observers (judges), insiders play the game of lobbyist, advertisers, and self-promoters. The game is to advance one’s theory, publish more than others, get more citations than others, and win more grant money than others. Honest reporting of failed studies does not advance this agenda. As a result, the fact that psychological studies report nearly exclusively success stories (Sterling, 1995; Sterling et al., 1995) tells outside observers nothing about the replicability of a published finding and the true rate of fluke findings could be 100%.

This problem has been known for over 50 years (Cohen, 1962; Sterling, 1959). So it would be wrong to call the selective reporting of successful studies an acute crisis. However, what changed is that some psychologists have started to criticize the widely accepted practice of selective reporting of successful studies (Asendorpf et al., 2012; Francis, 2012; Simonsohn et al., 2011; Schimmack, 2012; Wagenmakers et al., 2011). Over the past five years, psychologists, particularly social psychologists, have been engaged in heated arguments over the so-called “replication crisis.”

One group argues that selective publishing of successful studies occurred, but without real consequences on the trustworthiness of published results. The other group argues that published results cannot be trusted unless they have been successfully replicated. The problem is that neither group has objective information about the replicability of published results.  That is, there is no reliable estimate of the percentage of studies that would produce a significant result again, if a representative sample of significant results published in psychology journals were replicated.

Evidently, it is not possible to conduct exact replication studies of all studies that have been published in the past 50 years. Fortunately, it is not necessary to conduct exact replication studies to obtain an objective estimate of replicability. The reason is that replicability of exact replication studies is a function of the statistical power of studies (Sterling et al., 1995). Without selective reporting of results, a 95% success rate is an estimate of the statistical power of the studies that achieved this success rate. Vice versa, a set of studies with average power of 50% is expected to produce a success rate of 50% (Sterling, et al., 1995).

Although selection bias renders success rates uninformative, the actual statistical results provide valuable information that can be used to estimate the unbiased statistical power of published results. Although selection bias inflates effect sizes and power, Brunner and Schimmack (forcecoming) developed and validated a method that can correct for selection bias. This method makes it possible to estimate the replicability of published significant results on the basis of the original reported results. This statistical method was used to estimate the replicabilty of research published by psychology departments in the years from 2010 to 2015 (see Methodology for details).

The averages for the 2010-2012 period (M = 59) and the 2013-2015 period (M = 61) show only a small difference, indicating that psychologists have not changed their research practices in accordance with recommendations to improve replicability in 2011  (Simonsohn et al., 2011). For most of the departments the confidence intervals for the two periods overlap (see attached powergraphs). Thus, the more reliable average across all years is used for the rankings, but the information for the two time periods is presented as well.

There are no obvious predictors of variability across departments. Private universities are at the top (#1, #2, #8), the middle (#24, #26), and at the bottom (#44, #47). European universities can also be found at the top (#4, #5), middle (#25) and bottom (#46, #51). So are Canadian universities (#9, #15, #16, #18, #19, #50).

There is no consensus on an optimal number of replicability.  Cohen recommended that researchers should plan studies with 80% power to detect real effects. If 50% of studies tested real effects with 80% power and the other 50% tested a null-hypothesis (no effect = 2.5% probability to replicate a false result again), the estimated power for significant results would be 78%. The effect on average power is so small because most of the false predictions produce a non-significant result. As a result, only a few studies with low replication probability dilute the average power estimate. Thus, a value greater than 70 can be considered broadly in accordance with Cohen’s recommendations.

It is important to point out that the estimates are very optimistic estimates of the success rate in actual replications of theoretically important effects. For a representative set of 100 studies (OSC, Science, 2015), Brunner and Schimmack’s statistical approach predicted a success rate of 54%, but the success rate in actual replication studies was only 37%. One reason for this discrepancy could be that the statistical approach assumes that the replication studies are exact, but actual replications always differ in some ways from the original studies, and this uncontrollable variability in experimental conditions posses another challenge for replicability of psychological results.  Before further validation research has been completed, the estimates can only be used as a rough estimate of replicability. However, the absolute accuracy of estimates is not relevant for the relative comparison of psychology departments.

And now, without further ado, the first objective rankings of 51 psychology departments based on the replicability of published significant results. More departments will be added to these rankings as the results become available.

Rank University 2010-2015 2010-2012 2013-2015
1 U Penn 72 69 75
2 Cornell U 70 67 72
3 Purdue U 69 69 69
4 Tilburg U 69 71 66
5 Humboldt U Berlin 67 68 66
6 Carnegie Mellon 67 67 67
7 Princeton U 66 65 67
8 York U 66 63 68
9 Brown U 66 71 60
10 U Geneva 66 71 60
11 Northwestern U 65 66 63
12 U Cambridge 65 66 63
13 U Washington 65 70 59
14 Carleton U 65 68 61
15 Queen’s U 63 57 69
16 U Texas – Austin 63 63 63
17 U Toronto 63 65 61
18 McGill U 63 72 54
19 U Virginia 63 61 64
20 U Queensland 63 66 59
21 Vanderbilt U 63 61 64
22 Michigan State U 62 57 67
23 Harvard U 62 64 60
24 U Amsterdam 62 63 60
25 Stanford U 62 65 58
26 UC Davis 62 57 66
27 UCLA 61 61 61
28 U Michigan 61 63 59
29 Ghent U 61 58 63
30 U Waterloo 61 65 56
31 U Kentucky 59 58 60
32 Penn State U 59 63 55
33 Radboud U 59 60 57
34 U Western Ontario 58 66 50
35 U North Carolina Chapel Hill 58 58 58
36 Boston University 58 66 50
37 U Mass Amherst 58 52 64
38 U British Columbia 57 57 57
39 The University of Hong Kong 57 57 57
40 Arizona State U 57 57 57
41 U Missouri 57 55 59
42 Florida State U 56 63 49
43 New York U 55 55 54
44 Dartmouth College 55 68 41
45 U Heidelberg 54 48 60
46 Yale U 54 54 54
47 Ohio State U 53 58 47
48 Wake Forest U 51 53 49
49 Dalhousie U 50 45 55
50 U Oslo 49 54 44
51 U Kansas 45 45 44

 

Advertisements

Reported Success Rates, Actual Success Rates, and Publication Bias In Psychology: Honoring Sterling et al. (1995)

Sterling, Rosenbaum, J. J. Weinkam (1995) “Publication Decisions Revisited: The Effect of the Outcome of Statistical Tests on the Decision to Publish and Vice Versa”

When I discovered Sterling et al.’s (1995) article, it changed my life forever. I always had the suspicion that some articles reported results that are too good to be true. I also had my fair share of experiences where I tried to replicate an important finding to build on it, but only found out that I couldn’t replicate the original finding. Already skeptical by nature, I became increasingly uncertain what findings I could actually believe. Discovering the article by Sterling et al. (1995) helped me to develop statistical tests that make it possible to distinguish credible from incredible (a.k.a., not trustworthy) results (Schimmack, 2012), the R-Index (Schimmack, 2014), and Powergraphs (Schimmack, 2015).

In this post, I give a brief summary of the main points made in Sterling et al. (1995) and then show how my method builds on their work to provide an estimate of the actual success rate in psychological laboratories.

Research studies from 11 major journals demonstrate the existence of biases that favor studies that observe effects that, on statistical evaluation, have a low probability of erroneously rejecting the so-called null hypothesis (Ho). This practice makes the probability of erroneously rejecting Ho different for the reader than for the investigator. It introduces two biases in the interpretation of the scientific literature: one due to multiple repetition of studies with false hypothesis, and one due to failure to publish smaller and less significant outcomes of tests of a true hypothesis (Sterling et al., 1995, p. 108).

The main point of the article was to demonstrate that published results are biased. Several decades earlier, Sterling (1959) observed that psychology journals nearly exclusively publish support for theoretical predictions. The 1995 article showed that nothing had changed. It also showed that medical journals were more willing to publish non-significant results. The authors pointed out that publication bias has two negative consequences on scientific progress. First, false results that were published cannot be corrected because non-significant results are not published. However, when a false effect produces another false positive it can be published and it appears as if the effect was successfully replicated. As a result, false results can accumulate and science cannot self-correct and weed out false positives. Thus, a science that publishes only significant results is like a gambler who only remembers winning nights. It appears successful, but it is bankrupting itself in the process.

The second problem is that non-significant results do not necessarily mean that an effect does not exist. It is also possible that the study had insufficient statistical power to rule out chance as an explanation. If these non-significant results were published, they could be used by future researchers to conduct more powerful studies or to conduct meta-analyses. A meta-analysis uses the evidence from many small studies to combine the information into evidence from one large study, which makes it possible to detect small effects. However, if publication bias is present, a meta-analysis will always conclude that an effect is present because significant results are more likely to be included.
Both problems are important, but for psychology the first problem appeared to be a bigger problem because it published nearly exclusively significant results. There was no mechanism for psychology to correct itself until a recent movement started to question the credibility of published results in psychology.

The authors examined how often authors reported that a critical hypothesis test confirmed a theoretical prediction. The Table shows the results for the years 1986-87 and for the year 1958. The 1958 results are based on Sterling (1959).

 

Journal 1987 1958
Journal of Experimental Psychology 93% 99%
Comparative & Physiological Psychology 97% 97%
Consulting & Clinical Psychology 98% 95%
Personality and Social Psychology 96% 97%

 

The authors use the term “proportion of studies rejecting H0” to refer to the percentages of studies with significant results. I call it the success rate. A researcher who plans a study to confirm a theoretical prediction has a success when the study produces a significant result. When the result is not significant, the researcher cannot claim support for the hypothesis and the study is a failure. Failure does not mean that the study was not useful and that the result should not be published. It just means that the study does not provide sufficient support for a prediction.

Sterling et al. (1995) distinguish between the proportion of published studies rejecting H0 and the proportion of all conducted studies rejecting H0. I use the terms reported success rate and actual success rate. Without publication bias, the reported success rate and the actual success rate are the same. However, when publication bias or reporting bias is present, the reported success rate exceeds the actual success rate. A gambler might win on 45% trips to the casino, but he may tell his friends that he wins 90% of the time. This discrepancy reveals a reporting bias. Similarly, a researcher may have a success rate of 40% of studies (or statistical analyses if multiple analyses are conducted with one data set), but the published studies show a 95% success rate. The difference shows the effect of reporting bias.

A reported success rate of 95% in psychology journals seems high, but it does not automatically imply that there is publication bias. To make claims about the presence of publication bias it is necessary to find out what the actual success rate of psychological researchers is. When researchers press the button on a statistics software and look up the p-value, and the result will be used for a publication, how often is this p-value below the critical .05 value. How often do researchers go “Yeah” and “Got it.” versus “S***” and move on to another significance test? [I’ve been there; I have done it.]

Sterling et al. (1995) provide a formula that can be used to predict the actual success rate of a researcher. Actually, Sterling et al. (1995) predicted the failure rate, but the formula can be easily modified to predict the success rate. I first present the original formula for the prediction of failure, but I spare readers the Greek notation.

FR = proportion of studies accepting H0
%H0 = proportion of studies where H0 is true
B = average type-II error probability (type-II error = non-significant result when H0 is false)
C = Criterion value for significance (typically p < .05, two-tailed, also called alpha)

FR = %H0 * B + (1 – %H0) * C

The corresponding formula for the success rate is

SR = %H0 * C + (1 – %H0) * (1 – B)

In this equation, (1 – B) is the average probability to obtain a significant effect when an effect is present, which is known as statistical power (P). Substituting (1 – B) with P gives the formula

SR = %H0 * C + (1 – %H0) * P

This formula says that the success rate is a function of the criterion for significance (C), the proportion of studies where the null-hypothesis is true (%H0) and the average statistical power of studies when an effect is present.

The problem with this formula is that the proportion of true null-effects is unknown or even unknowable. It is unknowable because the null-hypothesis is a point prediction of an effect size and even the smallest deviation from this point prediction invalidates H0 and H1 is true. H0 is true if the effect is exactly zero, but H1 is true if the effect is 0.00000000000000000000000000000001. And even if it were possible to demonstrate that the effect is smaller than 0.00000000000000000000000000000001, it is possible that the effect is 0.000000000000000000000000000000000000000001 and the null-hypothesis would still be false.

Fortunately, it is not necessary to know the proportion of true null-hypotheses to use Sterling et al.’s (1995) formula. Sterling et al. (1995) make the generous assumption that H0 is always false. Researchers may be wrong about the direction of a predicted effect, but the two-tailed significance tests helps to correct this false prediction by showing a significant result in the opposite direction (a one-tailed test would not be able to do this). Thus, H0 is only true when the effect size is exactly zero and it has been proposed that this is very unlikely. Eating a jelly bean a day may not have a noticeable effect on life expectancy, but can we be sure a priori that the effect is exactly 0? Maybe it extends or shortens life-expectancy by 5 seconds. This would not matter to lovers of jelly beans, but the null-hypothesis that the effect is zero would be false.

Even if the null-hypothesis is true in some cases, it is irrelevant because the assumption that it is always false is the best case scenario for a researcher, which makes the use of the formula conservative. The actual success rate can only be lower than the estimated success rate based on the assumption that all null-hypotheses are false. With this assumption, the formula is reduced to

SR = P

This means that the actual success rate is a function of the average power of studies. The formula also implies that an unbiased sample of studies provides an estimate of the average power of studies.

P = SR

Sterling et al. (1995) contemplate what a 95% success rate would mean if no publication bias were present.

If we take this formula at face value, it suggests that only studies with high power are performed and that the investigators formulate only true hypothesis.”

In other words, a 95% success rate in the journals can only occur if the null-hypothesis is false and researchers conduct studies with 95% power or if the null-hypothesis is true in 5% of the studies and true power is 100%.”

Most readers are likely to agree with Sterling et al. (1995) that “common experience tells us that such is unlikely.” Thus, publication bias is most likely to contribute to the high success rate. The really interesting question is whether it is possible to (a) estimate the actual success rate and (b) estimate the extend of publication bias.

Sterling et al. (1995) use post-hoc power analysis to obtain some estimate of the actual success rate.

Now alpha is usually .05 or less, and beta while unknown and variable, is frequently .15-.75 (Hedges 1984). For example, if beta = .05 and we take B = .2 as a conservative estimate, then the proportion of studies that should accept Ho is .95-.75 percent. Thus even if the null hypothesis is always false , we would expect about 20% of published studies to be unable to reject Ho.”

To translate, even with a conservative estimate that the type-II error rate is 20% (i.e., average power is 80%), 20% of published studies should report a non-significant result. Thus, the 95% reported success rate is inflated by publication bias by at least 15%.”

One limitation of Sterling et al.’s (1995) article is that they do not provide a more precise estimate of the actual success rate.

There are essentially three methods to obtain estimates of the actual success rate. One could conduct a survey of researchers and ask them to report how often they obtain significant results in statistical analyses that are conducted for the purpose of publication. Nobody has tried to use this approach. I only heard some informal rumor that a psychologist compared his success rate to batting averages in baseball and was proud of a 33% success rate (a 33% batting average is a good average for hitting a small ball that comes at you at over 80mph).

The second approach would be to take a representative sample of theoretically relevant statistical tests (i.e., excluding statistical tests of manipulation checks or covariates) from published articles and to replicate these studies as closely as possible. The success rate in the replication studies provides an estimate of the actual success rate in psychology because the replication studies do not suffer from publication bias.

This approach was taken by the Open Science Collaboration (2015), although with a slight modification. The replication studies tried to replicate the original studies as closely as possible, but sample sizes differed from the original studies. As sample size has an influence on power, the success rate in the replication studies is not directly comparable to the actual success rate of the original studies. However, sample sizes were often increased and they were usually only decreased if the original study appeared to have very high power. As a result, the average power of the replication studies was higher than the average power of the original studies and the result can be considered an optimistic estimate of the actual success rate.

The study produced a success rate of 35% (95%CI = 25% to 45%). The study also showed different success rates for cognitive psychology (SR = 50%, 95%CI = 35% to 65%) and social psychology (SR = 25%, 95%CI = 14% to 36%).

The actual replication approach has a number of strength and weaknesses. The strength is that actual replications do not only estimate the actual success rate in original studies, but also test how robust these results are when an experiment is repeated. A replication study is never an exact replication study of the original study, but a study that reproduces the core aspects of the original study should be able to reproduce the same result. A weakness of actual replication studies is that they may have failed to reproduce core aspects of the original experiment. Thus, it is possible to attribute non-significant results to problems with the replication study. If 20% of the replication studies suffered from this problem, the actual success rate in psychology would increase from 36% to 45%. The problem with this adjustment is that the adjustment is arbitrary because it is impossible to know whether a replication study successfully reproduced the core aspects of an original experiment or not and using significant results as the criterion would lead to a circular argument;  a replication was successful only if it produced a significant result, which would lead to the absurd implication that only the 36% of replications that produced significant results were good replications and the actual success rate is back to 100%.

A third approach is to use the results of the original article to estimate the actual success rate. The advantage of this method is that it uses the very same results that were used to report a 97% success rate. Thus, no mistakes in data collection can explain discrepancies.  The problem is to find a statistical method to correct for publication bias.  There have been a number of attempts to correct for publication bias in meta-analyses of a set of studies (see Schimmack, 2014, for a review). However, a shared limitation of these methods is that they assume that all studies have the same power or effect size. This assumption is obviously violated when the set of studies spans different designs and disciplines.

Brunner and Schimmack (forthcoming) developed a method that can estimate the average power for a heterogeneous set of results while controlling for publication bias. The method first transforms the reported statistical result into an absolute z-score. This z-score represents the strength of evidence against the null-hypothesis. If non-significant results are reported, they are excluded from the analysis because publication bias makes the reporting of non-significant results unreliable. In the OSF-repoducibility project nearly all reported results were significant and two were marginally significant. Therefore this step is irrelevant. The next step is to reproduce the observed distribution of significant z-scores as a function of several non-centrality parameters and weights. The non-centrality parameters are then converted into power and the weighted average of power is the estimate of the actual average power of studies.

In this method the distinction between true null hypothesis and true effects is irrelevant because true null effects cannot be distinguished from studies with very low power to detect small effects. As a result, the success rate is equivalent to the average power estimate. The figure below shows the distribution of z-scores for the replicated studies (on the right side).

Powergraph for OSF-Reproducibility-Project

The estimated actual success rate is 54%. A 95% confidence interval is obtained by running 500 bootstrap analyses. The 95%CI ranges from 37% to 67%. This confidence interval overlaps with the confidence interval for the success rate in the replication studies of the reproducibility project. Thus, the two methods produce convergent evidence that the actual success rate in psychological laboratories is somewhere between 30% and 60%. This estimate is also consistent with post-hoc power analyses for moderate effect sizes (Cohen, 1962).

It is important to note that this success rate only applies to statistical tests that are included in publication when these tests produce a significant result. The selection of significant results also favors studies that actually had higher power and larger effect sizes, but research do not know a priori how much power their study has because the effect size is unknown. Thus, the power of all studies that are being conducted is even lower than the power estimated for the studies that produced significant results and were published.  The powergraph analysis also estimates power for all studies, including the estimated file-drawer of non-significant results. The estimate is 30% with a 95%CI ranging from 10% to 57%.

The figure on the left addresses another problem of actual replications. A sample of 100 studies is a small sample and may not be representative because researchers focused on studies that are easy to replicate. The statistical analysis of original results does not have this problem. The figure on the left side used all statistical tests that were reported in the three target journals in 2008; the year that was used to sample studies for the reproducibility project.

Average power is in the same same ball bark, but it is 10% higher than for the sample of replication studies and the confidence interval does not overlap with the 95%CI for the success rate in the actual replication studies. There are two explanations for this discrepancy. One explanation is that the power of tests of critical conditions is lower than the power of all statistical tests that can include manipulation checks or covariates. Another explanation could be that the sample of reproduced studies was not representative. Future research may help to explain the discrepancy.

Despite some inconsistencies, these results show that different methods can provide broadly converging evidence about the actual success rate in psychological laboratories. In stark contrast to reported success rates over 90%, the actual success rate is much lower and likely to be less than 60%. Moreover, this average glosses over differences in actual success rates in cognitive and social psychology. The success rate in social psychology is likely to be less than 50%.

CONCLUSION

Reported success rates in journals provide no information about the actual success rate when researchers conduct studies because publication bias dramatically inflates reported success rates. Sterling et al. (1995) showed that actual success rate is equivalent to the power of studies when the null-hypothesis is always false. As a result, the success rate in an unbiased set of studies is an estimate of average power and average power after correcting for publication bias is an estimate of the actual success rate before publication bias. The OSF-reproducibilty project obtained an actual success rate of 36%. A bias-corrected estimate of average power of the original studies produced an estimate of 54%. Given the small sample size of 100 studies, the confidence intervals overlap and both methods provide converging evidence that the actual success rate in psychology laboratories is much lower than reported success rate.  The ability to estimate actual success rates from published results makes it possible to measure reporting bias, which may help to reduce it.

 

 

Are You Planning a 10-Study Article? You May Want to Read This First

Here is my advice for researchers who are planning to write a 10-study article.  Don’t do it.

And here is my reason why.

Schimmack (2012) pointed out the problem of conducting multiple studies to test a set of related hypothesis in a single article (a.ka. multiple study articles).   The problem is that even a single study in psychology tend to have modest power to produce empirical support for a correct hypothesis (p < .05, two-tailed). This probability called statistical power is estimated to be 50 or 60% on average.   When researchers conduct multiple hypothesis tests, the probability of obtaining a significant result decreases exponentially. For 50% power, the probability that all tests provide a significant result halves with each study (.500, .250, .125, .063, etc.).

Schimmack (2012) used the term total power for the probability that a set of related hypothesis produces significant results. Few researchers who plan multiple study articles consider total power in the planning of their studies and multiple study articles do not explain how researchers deal with the likely outcome of a non-significant result. The most common practice is to simply ignore non-significant results and to report only results of studies that produced significant results. The problem with this approach is that the reported results overstate the empirical support for a theory, reported effect sizes are inflated, and researchers who want to build on these published findings are likely to end up with a surprising failure to replicate the original findings. A failed replication is surprising because the authors of the original article appeared to be able to obtain significant results in all studies. However, the reported success rate is deceptive and does not reveal the actual probability of a successful replication.

A number of statistical methods (TIVA, R-Index, P-Curve) have been developed to provide a more realistic impression of the strength and credibility of published results in multiple study articles. In this post, I used these tests to examine the evidence in a 10-study article in Psychological Science by Adam A. Galinski (Columbia Business School, Columbia University). I used this article because it is the article with the most studies in Psychological Science.

All 10 studies reported statistically significant results in support of the authors’ theoretical predictions.  An a priori power analysis suggests that authors who aim to present evidence for a theory in 10 studies need 98% (.80 raised to power of 1/10) power in each study to have an 80% probability to obtain significant results in all studies.

Each study reported several statistical results. I focused on the first focal hypothesis test to obtain statistical results for the examination of bias and evidential value. The p-values for each statistical test were converted into z-scores (inverse.normal (1-p/2).

Study N statistics p z obs.power
1 53 t(51)=2.71 0.009 2.61 0.74
2 61 t(59)=2.12 0.038 2.07 0.54
3 73 t(71)=2.78 0.007 2.7 0.77
4 33 t(31)=3.33 0.002 3.05 0.86
5 144 t(142)=2.04 0.043 2.02 0.52
6 83 t(79)=2.55 0.013 2.49 0.7
7 74 t(72)=2.24 0.028 2.19 0.59
8 235 t(233)=2.46 0.015 2.44 0.68
9 205 t(199)=3.85 0 3.78 0.97
10 109 t(104)=2.60 0.011 2.55 0.72

 

TIVA

The Test of Insufficient Variance was used to examine whether the variation in z-scores is consistent with the amount of sampling error that is expected for a set of independent studies (Var = 1).The variance in z-scores is less than one would expect from a set of 10 independent studies, Var(z) = .27. The probability that this reduction of variance occurred just by chance is p = .02.

Thus, there is evidence that the perfect 10 for 10 rate of significant results was obtained by means of dishonest reporting practices. Either failed studies were not reported or significant results were obtained with undisclosed research methods. For example, given the wide variation in sample sizes, optional stopping may have been used to obtain significant results. Consistent with this hypothesis, there is a strong correlation between sampling error (se = 2/sqrt[N]) and effect size (Cohen’s d = t * se) across the 10 studies, r(10) = .88.

R-INDEX

The median observed power for the 10 studies is 71%. Not a single study had observed power of 98% that is needed to have 80% total power.   Moreover, the 71% estimate is an inflated estimate of power because the success rate (100%) exceeds observed power (71%). After correcting for the inflation rate (100 – 71 = 29), the R-INDEX is 43%.

An R-Index of 43% is below 50%, suggesting that the true power of the studies is below 50% and that researchers who conduct an exact replication study are more likely to end up with a failure of replication than with a successful replication despite the apparent ability of the original authors to obtain significant results in all reported studies.

P-CURVE

A pcurve analysis shows that the results have evidential value, p = .02, using the convential criterion of p < .05. That is, it is unlikely that these 10 significant results were obtained without a real effect in at least one of the ten studies. However, excluding the most high-powered test in Study 9 renders the results of pcurve inconclusive, p = .11; that is, the hypothesis that the remaining 9 results were obtained without a real effect cannot be rejected at the conventional level of significance (p < .05).

These results show that the empirical evidence in this article is weak despite the impressive number of studies. The reason is that the absolute number of significant results is not an indicator of strength of evidence and that the reported rate of significant results is not an indicator of strength of evidence when non-significant results are not reported.

CONCLUSION

The statistical examination of this 10-study article reveals that the reported results are less robust than the 100% success rate suggests and that the reported results are unlikely to provide a complete account of the research program that generated the reported findings. Most likely, the researchers used optional stopping to increase their chances of obtaining significant results.

It is important to note that optional stopping is not necessarily a bad or questionable research practices. It is only problematic when the use of optional stopping is not disclosed. The reason is that optional stopping leads to biased effect size estimates and increases the type-I error probability, which invalidates the claim that results were significant at the nominal level that limits type-I error rates to 5%.

The results also highlight that the researchers were too ambitious in their goal to produce significant results in 10 studies. Even though their sample sizes are sometimes larger than the typical sample size in Psychological Science (N ~ 80), much larger samples would have been needed to produce significant results in all 10 studies.

It is also important that the article was published in 2013 and that it was common practice to exclude studies that fail to produce supporting evidence and to present results without full disclosure of the research methods that are used to produce these results at that time. Thus, the authors did not violate ethical standards of scientific integrity at that time.

However, publication standards are changing. When journals require full disclosure of data and methods, researchers need to change the way they plan their studies.  There are several options for researchers to change their research practices.

First, they can reduce the number of studies so that each study has a larger sample size and higher power to produce significant results. Authors who wish to report results from multiple studies need to take total power into account.  Eighty percent power in a single study is insufficient to produce significant results in multiple studies and power of each study needs to be adjusted accordingly (Total Power = Power ^ 1 / k;  ^ = raised to the power of,  k = number of studies).

Second, researchers can increase power by reducing the standard for statistical significance in a single study.  For example, it may be sufficient to claim support for a theory if 5 studies produced significant results with alpha = .20 (a 20% type-I error rate per study) because the combined type-I error rate decreases with the number of studies (total alpha = alpha ^ k).  Researchers can also conduct a meta-analysis of their individual studies to examine the total evidence across studies.

Third, researchers can specify a priori how many non-significant results they are willing to obtain and report. For example, researchers who plan 5 studies with 80% power can state that they expect one non-significant result.  An honest set of results will typically produce a variance in accordance with sampling theory (var(z) = 1) and median observed power would be 80% and there would be no inflation (expected success rate = 80% – expected median power = 80% = 0).  Thus, the R-Index would be 80 – 0 = 80.

In conclusion, there are many ways to obtain and report results of empirical results. There is only one way that is no longer an option, namely selective reporting of results that support theoretical predictions.  Statistical tests like TIVA, R-Index, P-Curve can reveal these practices and undermine the apparent value of articles that report many and only significant results.  As a result, the incentive structure is changing (again*) and researchers need to think hard about the amount of resources they really need to produce empirical results in multiple studies.

Footnotes

* The multiple-study article is a unique phenomenon that emerged in experimental psychology in the 1990s. It was supposed to provide more solid evidence and to protect against type-I errors in single study articles that presented exploratory results as if they confirmed theoretical predictions (HARKing).  However, dishonest reporting practices made it possible to produce impressive results without increased rigor.  At the same time, the allure of multiple study articles crowded out research that took time or required extensive resources to conduct only a single study.  As a result, multiple study articles often report studies that are quick (take less than 1 hour to complete) and cost little (Mturk participants are paid less than $1) or nothing (undergraduate students receive course credit).  Without real benefits and detrimental effects on the quality of empirical studies, I expected a decline in the number of studies per article and an increase in the quality of individual studies.

Dr. R Expresses Concerns about Results in Latest Psycholgical Science Article by Yaacov Trope and colleagues

This morning a tweet by Jeff Rouder suggested to take a closer look at an online first article published in Psychological Science.

When the Spatial and Ideological Collide

Metaphorical Conflict Shapes Social Perception

http://pss.sagepub.com/content/early/2016/02/01/0956797615624029.abstract

 

Trope1

The senior author of the article is Yaacov Trope from New York University. The powergraph of Yaacov Trope suggests that the average significant result that is reported in an article is based on a study with 52% in the years from 2000-2012 and 43% in the recent years from 2013-2015. The difference is probably not reliable, but the results show no evidence that Yaacov Trope has changed research practices in response to criticism of psychological research practices over the past five years.

Trope2.png

The average of 50% power for statistically significant results would suggest that every other test of a theoretical prediction produces a non-significant result. If, however, articles typically report that the results confirmed a statistical prediction, it is clear that dishonest reporting practices (excluding non-significant results or using undisclosed statistical methods like optional stopping) were used to present results that confirm theoretical predictions.

Moreover, the 50% estimate is an average. Power varies as a function of the strength of evidence and power for just significant results is lower than 50%. The range of z-scores from 2 to 2.6 approximately covers p-values in the range from .05 to .01 (just significant results). Average power for p-values in this range can be estimated by examining the contribution of the red (< 20% power), black (50% power) and green (85% power densities). In both graphs the density in this area is fully covered by the red and black lines, which implies that power is a mixture of 20% and 50%, which means power is less than 50%. Using the more reliable powergraph on the left, the red line (less than 20% power) covers a large portion of the area under the curve, suggesting that power for p-values between .05 and .01 is less than 33%.

The powergraph suggests that statistically significant results are only obtained with the help of random sampling error, reported effect sizes are inflated, and the probability of a false positive results is high because in underpowered studies the ratio of true positives vs. false positives is low.

In the article, Troope and colleagues report four studies. Casual inspection would suggest that the authors did conduct a rigorous program of research. They had relatively large samples (Ns = 239 to 410) and reported a priori power analyses that suggested they had 80% power to detect the predicted effects.

However, closer inspection with modern statistical methods to examine the robustness of results in a multiple study article show that the reported results cannot be interpreted at face value. To maintain statistical independence, I picked the first focal hypothesis test from each of the four studies.
CSV To HTML using codebeautify.org

Study N statistic p z obs.power
1 239 t(237)=2.06 0.04 2.053748911 0.537345692
2 391 t(389)=2.33 0.02 2.326347874 0.642947245
3 410 t(407)=2.13 0.03 2.170090378 0.583201432
4 327 t(325)=2.59 0.01 2.575829304 0.730996408

 

TIVA

TIVA examines whether a set of statistical results is consistent with the expected amount of sampling error. When test-statistics are converted into z-scores, sampling error should produce a variance of 1. However, the observed variance in the four z-scores is Var(z) = .05. Even with just four observations, a left-tailed chi-square test shows that this reduction in variance would occur rarely by chance, p = .02. This finding is consistent with the powergraph that shows reduced variance in z-scores because non-significant results that are predicted by the power analysis are not reported or significant results were obtained by violating sampling assumptions (e..g, undisclosed optional stopping).

R-INDEX

The able also shows that median observed power is only 61%, indicating that the a priori power analyses systematically overestimate power because they used effect sizes that were larger than the reported effect sizes. Moreover, the success rate in the four studies is 100%. When the success rate is higher than median observed power, actual power is even lower than observed power. To correct for this inflation in observed power, the R-Index subtracts the amount of inflation (100 – 61 = 39) from observed power. The R-Index is 61 – 39 = 22. Simulation studies show that an R-Index of 22 is obtained when the null-hypothesis is true (the predicted effect does not exist) and only significant results are being reported.

As it takes 20 studies to get 1 significant result by chance when the null-hypothesis is true, this model would imply that Troope and colleagues conducted another 4 * 20 – 4 = 76 studies with an average of 340 participants (a total of 25,973 participants) to obtain the significant results in their study. This is very unlikely. It is much more likely that Troope et al. used optional stopping to produce significant results.

Although the R-Index cannot reveal how the reported results were obtained, it does strongly suggest that these reported results will not be replicable. That is, other researchers who conduct the same study with the same sample sizes are unlikely to obtain significant results although Troope and colleagues reported getting significant results 4 out of 4 times.

P-Curve

TIVA and R-Index show that the reported results cannot be trusted at face value and that the reported effect sizes are inflated. These tests do not examine whether the data provide useful empirical evidence. P-Curve examines whether the data provide evidence against the null-hypothesis after taking into account that the results are biased. P-Curve shows that the results in this article do not contain evidential value (p = .69); that is, after correcting for bias the results do not reject the null-hypothesis at the convential p < .05 level.

Conclusion

Statisticians have warned psychologists for decades that only reporting significant results that support theoretical predictions is not science (Sterling, 1959). However, generations of psychologists have been trained to conduct research by looking for and reporting significant results that they can explain. In the past five years, a growing number of psychologists have realized the damage of this pseudo-scientific method for advancing understanding of human behavior.

It is unfortunate that many well-established researchers have been unable to change the way they conduct research and that the very same established researchers in their roles as reviewers and editors continue to let this type of research being published. It is even more unfortunate that these well-established researchers do not recognize the harm they are causing for younger researchers who end up with publications that tarnish their reputation.

After five years of discussion about questionable research practices, ignorance is no longer an excuse for engaging in these practices. If optional stopping was used, it has to be declared in the description of the sampling strategy. An article in a top journal is no longer a sure ticket to an academic job, if a statistical analysis reveals that the results are biased and do not contain evidential value.

Nobody benefits from empirical publications without evidential value. Why is it so hard to stop this nonsense?

Dr. R’s Blog about Replicability

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

For generalization, psychologists must finally rely, as has been done in all the older sciences, on replication (Cohen, 1994).

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DEFINITION OF REPLICABILITY:  In empirical studies with random error variance replicability refers to the probability of a study with a significant result to produce a significant result again in an exact replication study of the first study using the same sample size and significance criterion.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

2017 Blog Posts:

(October, 24, 2017)
Preliminary 2017 Replicability Rankings of 104 Psychology Journals

(September 19, 2017)
Reexaming the experiment to replace p-values with the probability of replicating an effect

(September 4, 2017)
The Power of the Pen Paradigm: A Replicability Analysis

(August, 2, 2017)
What would Cohen say: A comment on p < .005 as the new criterion for significance

(April, 7, 2017)
Hidden Figures: Replication failures in the stereotype threat literature

(February, 2, 2017)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
REPLICABILITY REPORTS:  Examining the replicability of research topics

RR No1. (April 19, 2016)  Is ego-depletion a replicable effect? 
RR No2. (May 21, 2016) Do mating primes have replicable effects on behavior?
RR No3. (September 4, 2017) The power of the pen paradigm: A replicability analysis

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

TOP TEN LIST

RR.Logo

1.  Preliminary 2017  Replicability Rankings of 104 Psychology Journals
Rankings of 104 Psychology Journals according to the average replicability of a published significant result. Also includes detailed analysis of time trends in replicability from 2010 to 2017, and a comparison of psychological disciplines (cognitive, clinical, social, developmental, biological).

weak

2.  Z-Curve: Estimating replicability for sets of studies with heterogeneous power (e.g., Journals, Departments, Labs)
This post presented the first replicability ranking and explains the methodology that is used to estimate the typical power of a significant result published in a journal.  The post provides an explanation of the new method to estimate observed power based on the distribution of test statistics converted into absolute z-scores.  The method has been developed further to estimate power for a wider range of z-scores by developing a model that allows for heterogeneity in power across tests.  A description of the new method will be published when extensive simulation studies are completed.

Say-No-to-Doping-Test-Image

3. An Introduction to the R-Index
The R-Index can be used to predict whether a set of published results will replicate in a set of exact replication studies. It combines information about the observed power of the original studies with information about the amount of inflation in observed power due to publication bias (R-Index = Observed Median Power – Inflation). The R-Index has predicted the outcome of actual replication studies.

Featured Image -- 203

4.  The Test of Insufficient Variance (TIVA)
The Test of Insufficient Variance is the most powerful test of publication bias and/or dishonest reporting practices. It can be used even if only two independent statistical results are available, although power to detect bias increases with the number of studies. After converting test results into z-scores, z-scores are expected to have a variance of one.   Unless power is very high, some of these z-scores will not be statistically significant (z .05 two-tailed).  If these non-significant results are missing, the variance shrinks, and TIVA detects that the variance is insufficient.  The observed variance is compared against the expected variance of 1 with a left-tailed chi-square test. The usefulness of TIVA is illustrated with Bem’s (2011) “Feeling the Future” data.

train-wreck-15.  MOST VIEWED POST (with comment by Noble Laureate Daniel Kahneman)
Reconstruction of a Train Wreck: How Priming Research Went off the Rails
This blog post examines the replicability of priming studies cited in Daniel Kahneman’s popular book “Thinking fast and slow.”   The results suggest that many of the cited findings are difficult to replicate.

http://schoolsnapshots.org/blog/2014/09/30/math-prize-for-girls-at-m-i-t/6. How robust are Stereotype-Threat Effects on Women’s Math Performance?
Stereotype-threat has been used by social psychologists to explain gender differences in math performance. Accordingly, the stereotype that men are better at math than women is threatening to women and threat leads to lower performance.  This theory has produced a large number of studies, but a recent meta-analysis showed that the literature suffers from publication bias and dishonest reporting.  After correcting for these effects, the stereotype-threat effect was negligible.  This blog post shows a low R-Index for the first article that appeared to provide strong support for stereotype-threat.  These results show that the R-Index can warn readers and researchers that reported results are too good to be true.

GPower7.  An attempt at explaining null-hypothesis testing and statistical power with 1 figure and 1500 words.   Null-hypothesis significance testing is old, widely used, and confusing. Many false claims have been used to suggest that NHST is a flawed statistical method. Others argue that the method is fine, but often misunderstood. Here I try to explain NHST and why it is important to consider power (type-II errors) using a picture from the free software GPower.

snake-oil

8.  The Problem with Bayesian Null-Hypothesis Testing
Some Bayesian statisticians have proposed Bayes-Factors to provide evidence for a Null-Hypothesis (i.e., there is no effect).  They used Bem’s (2011) “Feeling the Future” data to argue that Bayes-Factors would have demonstrated that extra-sensory perception does not exist.  This blog post shows that Bayes-Factors depend on the specification of the alternative hypothesis and that support for the null-hypothesis is often obtained by choosing an unrealistic alternative hypothesis (e.g., there is a 25% probability that effect size is greater than one standard deviation, d > 1).  As a result, Bayes-Factors can favor the null-hypothesis when there is an effect, but the effect size is small (d = .2).  A Bayes-Factor in favor of the null is more appropriately interpreted as evidence that the alternative hypothesis needs to decrease the probabilities assigned to large effect sizes. The post also shows that Bayes-Factors based on a meta-analysis of Bem’s data provide misleading evidence that an effect is present because Bayesian statistics do not take publication bias and dishonest reporting practices into account.

hidden9. Hidden figures: Replication failures in the stereotype threat literature.  A widespread problem is that failed replication studies are often not published. This blog post shows that another problem is that failed replication studies are ignored even when they are published.  Selective publishing of confirmatory results undermines the credibility of science and claims about the importance of stereotype threat to explain gender differences in mathematics.

20170620_14554410. My journey towards estimation of replicability.  In this blog post I explain how I got interested in statistical power and replicability and how I developed statistical methods to reveal selection bias and to estimate replicability.

A Scathing Review of “Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science”

J Pers Soc Psychol. 2015 Feb;108(2):275-97. doi: 10.1037/pspi0000007.
Best research practices in psychology: Illustrating epistemological and pragmatic considerations with the case of relationship science.
Finkel EJ, Eastwick PW, Reis HT.  

[link to free pdf]

The article “Best Research Practices in Psychology: Illustrating Epistemological and Pragmatic Considerations With the Case of Relationship Science” examines how social psychologists should respond to the crisis of confidence in the wake of scandales that rocked social psychology in 2011 (i.e., the Staple debacle and the Bem bust).
The article is written by prolific relationship researchers, Finkel, Eastwick, and Reis (FER), and is directed primarily at relationship researchers, but their article also has implications for social psychology in general. In this blog post, I critically examine FER’s recommendations for “best research practices.”

THE PROBLEM

FER and I are in general agreement about the problem. The goal of empirical science is to obtain objective evidence that can be used to test theoretical predictions. If the evidence supports a theoretical prediction, a theory that made this prediction gets to live another day. If the evidence does not support the prediction, the theory is being challenged and may need to be revised. The problem is that scientists are not disinterested observers of empirical phenomena. Rather, they often have a vested interest in providing empirical support for a theory. Moreover, scientists have no obligation to report all of their data or statistical analyses. As a result, the incentive structure encourages self-serving selection of supportive evidence. While data fabrication is a punishable academic offense, dishonest reporting practices have been and are still being tolerates.

The 2011 scandals led to numerous calls to curb dishonest reporting practices and to encourage or enforce honest reporting of all relevant materials and results. FER use the term “evidential value movement” to refer to researchers who have proposed changes to research practices in social psychology.

FER credit the evidential value movement with changes in research practices such as (a) reporting how sample sizes were determined to have adequate power to demonstrate predicted effects, (b) avoiding the use of dishonest research practices that inflate the strength of evidence and effect sizes, and (c) encouraging publications of replication studies independent of the outcome (i.e., a study may actually fail to provide support for a hypothesis).

FER propose that these changes are not necessarily to the benefit of social psychology. To make their point, they introduce Neyman-Pearson’s distinction between type-I errors ((a.k.a, false-positives) and type-II errors (a.k.a. false negatives). A type-I error occurs when a researcher draws the conclusion that an effect exists, but an effect does not exist (a cold remedy shows a statistically significant result in a clinical trial, but it has no real effect). A type-II error occurs when an effect exists, but a study fails to show a statistically significant result (e.g., a cold remedy does reduce cold symptoms, but a clinical trial fails to show a statistically significant result).
By convention, the type-I error rate in social psychology is set at 5%. This means, that in the long run no more than 5% of significant results in independent tests are false positive results and the maximum of 5% is only reached if all studies tested false hypotheses (i.e., they predicted an effect when no effect exists). As the number of true prediction increases, the actual rate of false-positive results decreases. If all hypothesis are true (the null-hypothesis that there is no effect is always false), the false-positive rate is 0 because it is impossible to make a type-I error. A maximum of 5% false-positive results has assured generations of social psychologists that most published results are likely to be true.

Unlike the type-I error probability that is set by convention, the type-II error probability is unknown because it depends on the unknown size of an effect. However, meta-analyses of actual studies can be used to estimate the typical type-II error probability in social psychology. In a seminal article, Cohen (1962) estimated that the type-II error rate is 50% for studies with a medium effect size. Power for studies of larger effects is higher and power for studies with smaller effects is lower. Actual power would depend on the distribution of small, large, and medium effects, but an estimate of 50% is a reasonable estimate. Cohen (1962) also proposed that a type-II error rate of 50% is unacceptably high and suggested that researchers should plan studies to reduce the type-II error rate to 20%. A common term for the complementary probability of avoiding a type-Il error is power (Power = 1 – Prob. Type-II Error) and Cohen suggested that psychologists plan studies with 80% power to detect effects that actually exist.

WHAT ARE THE TYPE-I and TYPE-II ERROR RATES IN PSYCHOLOGY?

Assuming that researcher follow Cohen’s recommendation (a questionable assumption) FER write “the field has, in principle, been willing to accept false positives 5% of the time and false negatives 20% of the time.” They then state in parenthesis that the “de facto false-positive and false-negative rates almost certainly have been higher than these nominal levels”.

In this parenthesis, FER hide the real problem that created the evidential value movement. The main point of the evidential value movement is that a type-I error probability of 5% does not tell us much about the false positive rate (how many false-positive results are being published) when dishonest reporting practices are allowed (Sterling, 1959).

For example, if a researcher conducts 10 tests of a hypothesis and only one test obtains a significant result and only the significant result is published, the probability of a false-positive result increased from 5% to 50%. Moreover, readers would be appropriately skeptical about a discovery that is matched by 9 failures to discover the same effect. In contrast, if readers only see the significant result, it seems as if the actual success rate is 100% rather than 10%. When only significant results are being reported, the 5% criterion no longer sets an upper limit and the real rate of false positive results could be 100% (Sterling, 1959).

The main goal of the evidential value movement is to curb dishonest reporting practices. A major theme in the evidential value movement is that editors and reviewers should be more tolerant of non-significant results, especially in multiple study articles that contain several tests of a theory (Schimmack, 2012). For example, in a multiple study paper with five studies and 80% power, one of the four studies is expected to produce a type-II error if the effect exists in all five studies. If power is only 50%, 2 or 3 studies should fail to provide statistically significant support for a hypothesis on their own.

Traditionally, authors excluded these studies from their multi-study articles and all studies provided support for their hypothesis. To reduce this dishonest reporting practice, editors should focus on the total evidence and allow for non-significant results in one or two studies. If four out of five studies produce a significant result, there is strong evidence for a theory and the evidence is stronger if all five studies are reported honestly.

Surprisingly, FER write that this change in editorial policy will “not necessarily alter the ratio of false positive to false negative errors” (p. ). This statement makes no sense because reporting of non-significant result that were previously hidden in file-drawers would reduce the percentage of type-I errors (relative to all published results) and increase the percentage of type-II errors that are being reported (because many non-significant results in underpowered studies are type-II errors). Thus, more honest reporting of results would increase the percentage of reported type-II errors and FER are confusing readers if they suggest that this is not the case.
Even more problematic is FER’s second scenario. Accordingly, researchers continue to conduct studies with low power (50%), submit manuscripts with multiple studies, where half the studies show statistically significant results and the other half do not, and editors reject these articles because they do not provide strong support for the hypothesis in all studies. FER anticipate that we would “see a marked decline in journal acceptance rates”. However, FER fail to mention a simple solution to this problem. Researchers could (and should) combine the resources that were needed to produce five studies with 50% powers to conduct one study that has a high probability of being successful (Schimmack, 2012). As a result, the type-I error rate and the type-II error rate would decrease. The type-I error rate would decrease because fewer tests are being conducted (e.g., conduct 10 studies to get 5 significant results, which doubles the probability that a significant result was obtained even if no effect exists). The Type-II error rate would decrease because researchers have more power to show the predicted effect without the use of dishonest research practices.

Alternatively, researchers can continue to conduct and report multiple underpowered studies, but abandon the elusive goal of finding significant results in each study. Instead, they could ignore significance tests of individual studies and conduct inferential statistical tests in a meta-analysis of all studies (Schimmack, 2012). The consequences for type-I and type-II error rates are the same as if researchers had conducted a single, more powerful study. Both approaches reduce type-I and type-II error rates because they reduce the number of statistical tests.

Based on their flawed reasoning, FER come to the wrong conclusion when they state “our point here is not that heightened stringency regarding false-positive rates is bad, but rather that it will almost certainly increase false-negative rates, which renders it less than an unmitigated scientific good.”

As demonstrated above, this statement is false because a reduction in statistical tests and an increase in power of each individual tests reduces the risk of type-I error rates and decreases the probability of making a type-II error (i.e., a false negative result).

WHAT IS AN ERROR BALANCED APPROACH?

As a result of FER’s false premise their recommendations for best practices that are based on this false premise are questionable.  In fact, it is not even clear what their recommendations are when they introduce their error balanced approach that is supposed to have three principles.

PRINCIPLE 1

The first principle is that both false positives and false negatives undermine the superordinate goals of science.

This principle is hardly controversial. It is problematic if a study shows that a drug is effective when the drug is actually not effective and it is problematic if an underpowered study fails to show that a drug is actually effective. FER fail to mention a long list of psychologists, including Jacob Cohen, who have tried to change the indifferent attitude of psychologists to non-significant results and the persistent practice of conducting underpowered studies that provide ample opportunity for multiple statistical tests so that at least one statistically significant result will emerge that can be used for a publication.

As noted earlier, the type-I error probability for a single statistical test is set at a maximum of 5%, but estimates of the type-II error probability are around 50%, a ten-fold difference. Cohen and others have advocated to increase power to 80%, which would reduce the type-II error risk to 20%. This would still imply that type-I error are considered more harmful than type-II errors by a ratio of 1:4 (5% vs. 20%).
Yet, FER do not recommend increasing statistical power, which would imply that the type-II error rate remains at 50%. The only other way to balance the two error rates would be to increase the type-I error rate. For example, one could increase the type-I error rate to 20%. As power increases when the significance criterion increases (becomes more liberal), this approach would also decrease the risk of type-II errors. The type-II error rate decreases when alpha is raised because results that were not significant are now significant. The risk is that more of these significant results are false-positives.  In a between-subject design with alpha = 5% (type-I error probability) and 50% power, power increases to 76% if alpha is raised to 20% and the two error probabilities are roughly matched (20% vs. 24%).

In sum, although I agree with FER that type-I and type-II errors are important, FER fail to mention how researchers should balance error rates and ignore the fact that the most urgent course of action is to increase power of individual studies.

PRINCIPLE 2

FER’s second principle is that neither type of error is “uniformly a greater threat to validity than the other type.”

Again, this is not controversial. In the early days of AIDS research, researchers and patients were willing to take greater risks in the hope that some medicine might work even if the probability of a false positive result in a clinical trial was high. When it comes to saving money in the supply of drinking water, a false negative result that the cheaper water is as healthy as the more expensive water is costly (of course, it is worse if it is well known that the cheaper water is toxic and politicians poison a population with toxic water).

A simple solution to this problem is to set the criterion value for an effect based on the implications of a type-I or a type-II error. However, in basic research no immediate actions have to be taken. The most common conclusion of a scientific article is that further research is needed. Moreover, researchers themselves can often conduct further research by conducting a follow-up study with more power. Therefore, it is understandable that the research community has been reluctant to increase the criterion for statistical significance from 5% to 20%

An interesting exception might be a multiple study article where a 5% criterion for each study makes it very difficult to obtain significant results in each study (Schimmack, 2012). One could adopt a more lenient 20% criterion for individual studies. A two study paper would already have only a 4% probability to produce a type-I error if both studies yielded a significant result (.20 * .20 = .04).

In sum, FER’s second principle about type-I and type-II errors is not controversial, but FER do not explain how the importance of type-I and type-II errors should influence the way researchers conduct their research and report their result. Most important, they do not explain why it would be problematic to report all results honestly.

PRINCIPLE III

FER’s third principle is that that “any serious consideration of optimal scientific practice must contend with both types of error simultaneously.”

I have a hard time distinguishing between principle I and principle III. Type-I and Type-II errors are both a problem and the problem of type-II errors in underpowered studies has been emphasized in a large literature on power with Jacob Cohen as the leading figure, but FER seem to be unaware of this literature or have another reason not to cite it, which reflects poorly on their scholarship. The simple solution to this problem has been outlined by Cohen: conduct fewer statistical tests with higher statistical power. FER have nothing to add to this simple statistical truth. A researcher who spends his whole live collecting data and at the end of his career conducts a single statistical test, and finds a significant result with p < .0001, is likely to have made a real discovery and a low probability to report a false positive result. In contrast, a researcher who publishes 100 statistical tests a year based on studies with low power will produce many false negative results and many false positive results.

This simple statistical truth implies that researchers have to make a choice. Do they want to invest their time and resources in many underpowered studies with many false positive and false negative results or do they want to invest their time and resources in a few high powered studies with few false positive and few false negative results?
Cohen advocated a slow and reliable approach when he said “less is more except for sample size.” FER fail to state where they stand because they started with the false premise that researchers can only balance the two types of errors without noticing that researchers can reduce both types of errors by conducting carefully planned studies with adequate power.

WHAT ABOUT HONESTY?

The most glaring omission in FER’s article is the lack of a discussion of dishonest reporting practices. Dishonest research practices are also called questionable research practices or p-hacking. Dishonest research practices make it difficult to distinguish between researchers who conduct carefully planned studies with high power from those who conduct many underpowered studies. If these researchers would report all of their results honestly, it would be easy to tell these two types of researchers apart. However, dishonest research practices allow researchers with underpowered studies to hide their false-negative results. As a result, the published record shows mostly significant results for both types of researchers, but this published record does not provide relevant information about the actual type-I and type-II errors being committed by the two researchers. The researcher with few, high powered studies has fewer unpublished non-significant results and a lower rate of published false positive results. The researcher with many underpowered studies has a large file-drawer filled with non-significant results that contains many false-negative results (discoveries that could have been made but were not made because the resources were spread too thin) and a higher rate of false-positive results in the published record.

The problem is that a system that tolerates dishonest reporting of results benefits researchers with many underpowered studies because they can publish more (true or false) discoveries and the number of (true or false) discoveries is used to reward researchers with positions, raises, awards, and grant money.

The main purpose of open science is to curb dishonest reporting practices. Preregistration makes it difficult to report a significant result that was not expected as predicted by a theory that was invented post-hoc after the results were known. Sharing of data sets makes it possible to check whether alternative analyses would have produced non-significant results. And rules about disclosing all measures makes it difficult to report only measures that produced a desired outcome. The common theme of all of these initiatives is to increase honesty. Rules that encourage or enforce honest reporting of all the evidence (good or bad) are assumed to be a guiding principle in science, but they are not being enforced and reporting only 3 studies with significant results when 15 studies were conducted is not considered a violation of scientific integrity.

What has been changing in the past years is a growing sense of awareness that dishonest reporting practices are harmful. Of course, it would have been difficult for FER to make a case for dishonest reporting practices and they do not make a positive case for dishonest reporting practices. However, they do present questionable arguments against recommendations that would curb questionable research practices and encourage honest reporting of results with the false argument that more honesty would increase the risk of type-II errors.

This argument is flawed because honest reporting of all results would provide an incentive for researchers to conduct more powerful studies that provide real support for a theory that can be reported honestly. Requirements to report all results honestly would also benefit researchers who conduct carefully planned studies with high power, which would reduce type-I and type-II error rates in the published literature. One might think everybody wins, but that is not the case. The losers in this new game would be researchers who have benefited from dishonest reporting practices.

CONCLUSION

FER’s article misrepresents the aims and consequences of the evidential value movement and fails to address the fundamental problem of allowing researches to pick and choose the results that they want to report. The consequences of tolerating dishonest reporting practices became visible in the scandals that rocked social psychology in 2011; the Stapel debacle and the Bem bust. Social psychology has been called a sloppy science. If social psychology wants to (re)gain respect from other psychologists, scientists, and the general public, it is essential that social psychologists enforce a code of conduct that requires honest reporting of results.

It is telling, that FER’s article appeared in the Interpersonal Relationship and Group Processes Section of the Journal of Personality and Social Psychology.  In the 2015 rankings of 106 psychology journals, JPSP:IRGP can be found at the bottom of the rankings with a rank of 99.  If relationship researchers take FER’s article as an excuse to resist changes in reporting practices, researchers may look towards other sciences (sociology) or other journals to learn about social relationships.

FER also fail to mention that new statistical developments have made it possible to distinguish between researches who conduct high-powered studies and those who use low-powered studies and report only significant results. These tools predict failures of replication in actual replication studies. As a result, the incentive structure is gradually changing and it is becoming more rewarding to conduct carefully-planned studies that can actually produce predicted results or in other words to be a scientist.

FINAL WORDS

It is 2016, five years after the 2011 scandals that started the evidential value movement.  I did not expect to see so much change in such a short time. The movement is gaining momentum and researchers in 2016 have to make a choice. They can be part of the solution or they can remain part of the problem.

VERY FINAL WORDS

Some psychologists do not like the idea that the new world of social media allows me to write a blog that has not been peer-reviewed.  I think that social media have liberated science and encourage real debate.  I can only imagine what would have happened if I had submitted this blog as a manuscript to JPSP:IRGP for peer-review.  I am happy to respond to comments by FER or other researchers and I am happy to correct any mistakes that I have made in the characterization of FER’s article or in my arguments about power and error rates.  Comments can be posted anonymously.

Keep your Distance from Questionable Results

Expression of Concern

http://pss.sagepub.com/content/19/3/302.abstract
doi: 10.1111/j.1467-9280.2008.02084.x

Lawrence E. Williams and
John A. Bargh

Williams and Bargh (2008) published the article “Keeping One’s Distance: The Influence of Spatial Distance Cues on Affect and Evaluation” in Psychological Science (doi: 10.1111/j.1467-9280.2008.02084.x)

As of August, 2015, the article has been cited 98 times in Web of Science.

The article reports four studies that appear to support the claim that priming individuals with the concept of spatial distance produced “greater enjoyment of media depicting embarrassment (Study 1), less emotional distress from violent media (Study 2), lower estimates of the number of calories in unhealthy food (Study 3), and weaker reports of emotional attachments to family members and hometowns (Study 4)”

However, a closer examination of the evidence suggests that the results of these studies were obtained with the help of questionable research methods that inflate effect sizes and the strength of evidence against the null-hypothesis (priming has no effect).

The critical test in the four studies was an Analysis of Variance that compared three experimental conditions.

The critical tests were:
F(2,67) = 3.14, p = .049, z = 1.96
F(2,39) = 4.37, p = .019, z = 2.34
F(2,56) = 3.36, p = .042, z = 2.03
F(2,81) = 4.97, p = .009, z = 2.60

The p-values can be converted into z-scores (norm.inv(1 – p/2)). The z-scores of independent statistical tests should follow a normal distribution and have a variance of 1. Insufficient variation in z-scores suggests that the results of the four studies are influenced by questionable research practices.

The variance of z-scores is Var(z) = 0.08. A chi-square test against the expected variance of 1 is significant, Chi-Square(df = 3) = .26, left-tailed p = .033.
The article reports 100% significant results, but median observed power is only 59%. With an inflation of 41%, the Replicability-Index is 59-41 = 18.

An R-Index of 18 is lower than the R-Index of 22, which would be obtained if the null-hypothesis were true and only significant results are reported. Thus, after correcting for inflation, the data provide no support for the alleged effect.

It is therefore not surprising that multiple replication attempts have failed to replicate the reported results. http://www.psychfiledrawer.org/chart.php?target_article=2

In conclusion, there is no credible empirical support for the theoretical claims in Williams and Bargh (2008) and the article should not be quoted as providing evidence for these claims.