Tag Archives: Publication Bias

Dr. R’s comment on the Official Statement by the Board of the German Psychological Association (DGPs) about the Results of the OSF-Reproducibility Project published in Science.

Thanks to social media, geography is no longer a barrier for scientific discourse. However, language is still a barrier. Fortunately, I understand German and I can respond to the official statement of the board of the German Psychological Association (DGPs), which was posted on the DGPs website (in German).

BACKGROUND

On September 1, 2015, Prof. Dr. Andrea Abele-Brehm, Prof. Dr. Mario Gollwitzer, and Prof. Dr. Fritz Strack published an official response to the results of the OSF-Replication Project – Psychology (in German) that was distributed to public media in order to correct potentially negative impressions about psychology as a science.

Numerous members of DGPs felt that this official statement did not express their views and noticed that members were not consulted about the official response of their organization. In response to this criticism, DGfP opened a moderated discussion page, where members could post their personal views (mostly in German).

On October 6, 2015, the board closed the discussion page and posted some final words (Schlussbeitrag). In this blog, I provide a critical commentary on these final words.

BOARD’S RESPONSE TO COMMENTS

The board members provide a summary of the core insights and arguments of the discussion from their (personal/official) perspective.

„Wir möchten nun die aus unserer Sicht zentralen Erkenntnisse und Argumente der unterschiedlichen Forumsbeiträge im Folgenden zusammenfassen und deutlich machen, welche vorläufigen Erkenntnisse wir im Vorstand aus ihnen ziehen.“

1. 68% success rate?

The first official statement suggested that the replication project showed that 68% of studies. This number is based on significance in a meta-analysis of the original and replication study. Critics pointed out that this approach is problematic because the replication project showed clearly that the original effect sizes were inflated (on average by 100%). Thus, the meta-analysis is biased and the 68% number is inflated.

In response to this criticism, the DGPs board states that “68% is the maximum [größtmöglich] optimistic estimate.” I think the term “biased and statistically flawed estimate” is a more accurate description of this estimate.   It is common practice to consider fail-safe-N or to correct meta-analysis for publication bias. When there is clear evidence of bias, it is unscientific to report the biased estimate. This would be like saying that the maximum optimistic estimate of global warming is that global warming does not exist. This is probably a true statement about the most optimistic estimate, but not a scientific estimate of the actual global warming that has been taking place. There is no place for optimism in science. Optimism is a bias and the aim of science is to remove bias. If DGPs wants to represent scientific psychology, the board should post what they consider the most accurate estimate of replicability in the OSF-project.

2. The widely cited 36% estimate is negative.

The board members then justify the publication of the maximally optimistic estimate as a strategy to counteract negative perceptions of psychology as a science in response to the finding that only 36% of results were replicated. The board members felt that these negative responses misrepresent the OSF-project and psychology as a scientific discipline.

„Dies wird weder dem Projekt der Open Science Collaboration noch unserer Disziplin insgesamt gerecht. Wir sollten jedoch bei der konstruktiven Bewältigung der Krise Vorreiter innerhalb der betroffenen Wissenschaften sein.“

However, reporting the dismal 36% replication rate of the OSF-replication project is not a criticism of the OSF-project. Rather, it assumes that the OSF-replication project was a rigorous and successful attempt to provide an estimate of the typical replicability of results published in top psychology journals. The outcome could have been 70% or 35%. The quality of the project does not depend on the result. The result is also not a negatively biased perception of psychology as a science. It is an objective scientific estimate of the probability that a reported significant result in a journal would produce a significant result again in a replication study.   Whether 36% is acceptable or not can be debated, but it seems problematic to post a maximally optimistic estimate to counteract negative implications of an objective estimate.

3. Is 36% replicability good or bad?

Next, the board ponders the implications of the 36% success rate. “How should we evaluate this number?” The board members do not know.  According to their official conclusion, this question is complex as divergent contributions on the discussion page suggest.

„Im Science-Artikel wurde die relative Häufigkeit der in den Replikationsstudien statistisch bedeutsamen Effekte mit 36% angegeben. Wie ist diese Zahl zu bewerten? Wie komplex die Antwort auf diese Frage ist, machen die Forumsbeiträge von Roland Deutsch, Klaus Fiedler, Moritz Heene (s.a. Heene & Schimmack) und Frank Renkewitz deutlich.“

To help the board members to understand the number, I can give a brief explanation of replicability. Although there are several ways to define replicability, one plausible definition of replicability is to equate it with statistical power. Statistical power is the probability that a study will produce a significant result. A study with 80% power has an 80% probability to produce a significant result. For a set of 100 studies, one would expect roughly 80 significant results and 20 non-significant results. For 100 studies with 36% power, one would expect roughly 36 significant results and 64 non-significant results. If researchers would publish all studies, the percentage of published significant results would provide an unbiased estimate of the typical power of studies.   However, it is well known that significant results are more likely to be written up, submitted for publication, and accepted for publication. These reporting biases explain why psychology journals report over 90% significant results, although the actual power of studies is less than 90%.

In 1962, Jacob Cohen provided the first attempt to estimate replicability of psychological results. His analysis suggested that psychological studies have approximately 50% power. He suggested that psychologists should increase power to 80% to provide robust evidence for effects and to avoid wasting resources on studies that cannot detect small, but practically important effects. For the next 50 years, psychologists have ignored Cohen’s warning that most studies are underpowered, despite repeated reminders that there are no signs of improvement, including reminders by prominent German psychologists like Gerg Giegerenzer, director of a Max Planck Institute (Sedlmeier & Giegerenzer, 1989; Maxwell, 2004; Schimmack, 2012).

The 36% success rate for an unbiased set of 100 replication studies, suggest that the actual power of published studies in psychology journals is 36%.  The power of all studies conducted is even lower because the p < .05 selection criterion favors studies with higher power.  Does the board think 36% power is an acceptable amount of power?

4. Psychologists should improve replicability in the future

On a positive note, the board members suggest that, after careful deliberation, psychologists need to improve replicability so that it can be demonstrated in a few years that replicability has increased.

„Wir müssen nach sorgfältiger Diskussion unter unseren Mitgliedern Maßnahmen ergreifen (bei Zeitschriften, in den Instituten, bei Förderorganisationen, etc.), die die Replikationsquote im temporalen Vergleich erhöhen können.“

The board members do not mention a simple solution to the replicabilty problem that was advocated over 50 years ago by Jacob Cohen. To increase replicability, psychologists have to think about the strength of the effects that they are investigating and they have to conduct studies that have a realistic chance to distinguish these effects from variation due to random error.   This often means investing more resources (larger samples, repeated trials, etc.) in a single study.   Unfortunately, the leaders of German psychologists appear to be unaware of this important and simple solution to the replication crisis. They neither mention power as a cause of the problem, nor do they recommend increasing power to increase replicability in the future.

5. Do the Results Reveal Fraud?

The DGPs board members then discuss the possibility that the OSF-reproducibilty results reveal fraud, like the fraud committed by Stapel. The board points out that the OSF-results do not imply that psychologists commit fraud because failed replications can occur for various reasons.

„Viele Medien (und auch einige Kolleginnen und Kollegen aus unserem Fach) nennen die Befunde der Science-Studie im gleichen Atemzug mit den Betrugsskandalen, die unser Fach in den letzten Jahren erschüttert haben. Diese Assoziation ist unserer Meinung nach problematisch: sie suggeriert, die geringe Replikationsrate sei auf methodisch fragwürdiges Verhalten der Autor(inn)en der Originalstudien zurückzuführen.“

It is true that the OSF-results do not reveal fraud. However, the board members confuse fraud with questionable research practices. Fraud is defined as fabricating data that were never collected. Only one of the 100 studies in the OSF-replication project (by Jens Förster, a former student of Fritz Strack, one of the board members) is currently being investigated for fraud by the University of Amsterdam.  Despite very strong results in the original study, it failed to replicate.

The more relevant question is how much questionable research practices contributed to the results. Questionable research practices are practices where data are being collected, but statistical results are only being reported if they produce a significant result (studies, conditions, dependent variables, data points that do not produce significant results are excluded from the results that are being submitted for publication. It has been known for over 50 years that these practices produce a discrepancy between the actual power of studies and the rate of significant results that are published in psychology journals (Sterling, 1959).

Recent statistical developments have made it possible to estimate the true power of studies after correcting for publication bias.   Based on these calculations, the true power of the original studies in the OSF-project was only 50%.   Thus a large portion of the discrepancy between nearly 100% reported significant results and a replication success rate of 36% is explained by publication bias (see R-Index blogs for social psychology and cognitive psychology).

Other factors may contribute to the discrepancy between the statistical prediction that the replication success rate would be 50% and the actual success rate of 36%. Nevertheless, the lion share of the discrepancy can be explained by the questionable practice to report only evidence that supports a hypothesis that a researcher wants to support. This motivated bias undermines the very foundations of science. Unfortunately, the board ignores this implication of the OSF results.

6. What can we do?

The board members have no answer to this important question. In the past four years, numerous articles have been published that have made suggestions how psychology can improve its credibility as a science. Yet, the DPfP board seems to be unaware of these suggestions or unable to comment on these proposals.

„Damit wären wir bei der Frage, die uns als Fachgesellschaft am stärksten beschäftigt und weiter beschäftigen wird. Zum einen brauchen wir eine sorgfältige Selbstreflexion über die Bedeutung von Replikationen in unserem Fach, über die Bedeutung der neuesten Science-Studie sowie der weiteren, zurzeit noch im Druck oder in der Phase der Auswertung befindlichen Projekte des Center for Open Science (wie etwa die Many Labs-Studien) und über die Grenzen unserer Methoden und Paradigmen“

The time for more discussion has passed. After 50 years of ignoring Jacob Cohen’s recommendation to increase statistical power it is time for action. If psychologists are serious about replicability, they have to increase the power of their studies.

The board then discusses the possibility of measuring and publishing replication rates at the level of departments or individual scientists. They are not in favor of such initiatives, but they provide no argument for their position.

„Datenbanken über erfolgreiche und gescheiterte Replikationen lassen sich natürlich auch auf der Ebene von Instituten oder sogar Personen auswerten (wer hat die höchste Replikationsrate, wer die niedrigste?). Sinnvoller als solche Auswertungen sind Initiativen, wie sie zurzeit (unter anderem) an der LMU an der LMU München implementiert wurden (siehe den Beitrag von Schönbrodt und Kollegen).“

The question is why replicability should not be measured and used to evaluate researchers. If the board really valued replicability and wanted to increase replicability in a few years, wouldn’t it be helpful to have a measure of replicability and to reward departments or researchers who invest more resources in high powered studies that can produce significant results without the need to hide disconfirming evidence in file-drawers?   A measure of replicability is also needed because current quantitative measures of scientific success are one of the reasons for the replicability crisis. The most successful researchers are those who publish the most significant results, no matter how these results were obtained (with the exception of fraud). To change this unscientific practice of significance chasing, it is necessary to have an alternative indicator of scientific quality that reflects how significant results were obtained.

Conclusion

The board makes some vague concluding remarks that are not worthwhile repeating here. So let me conclude with my own remarks.

The response of the DGPs board is superficial and does not engage with the actual arguments that were exchanged on the discussion page. Moreover, it ignores some solid scientific insights into the causes of the replicability crisis and it makes no concrete suggestions how German psychologists should change their behaviors to improve the credibility of psychology as a science. Not once do they point out that the results of the OSF-project were predictable based on the well-known fact that psychological studies are underpowered and that failed studies are hidden in file-drawers.

I received my education in Germany all the way to the Ph.D at the Free University in Berlin. I had several important professors and mentors that educated me about philosophy of science and research methods (Rainer Reisenzein, Hubert Feger, Hans Westmeyer, Wolfgang Schönpflug). I was a member of DGPs for many years. I do not believe that the opinion of the board members represent a general consensus among German psychologists. I hope that many German psychologists recognize the importance of replicability and are motivated to make changes to the way psychologists conduct research.  As I am no longer a member of DGfP, I have no direct influence on it, but I hope that the next election will elect a candidate that will promote open science, transparency, and above all scientific integrity.

Advertisements

The Replicability of Social Psychology in the OSF-Reproducibility Project

Abstract:  I predicted the replicability of 38 social psychology results in the OSF-Reproducibility Project. Based on post-hoc-power analysis I predicted a success rate of 35%.  The actual success rate was 8% (3 out of 38) and post-hoc-power was estimated to be 3% for 36 out of 38 studies (5% power = type-I error rate, meaning the null-hypothesis is true).

The OSF-Reproducibility Project aimed to replicate 100 results published in original research articles in three psychology journals in 2008. The selected journals focus on publishing results from experimental psychology. The main paradigm of experimental psychology is to recruit samples of participants and to study their behaviors in controlled laboratory conditions. The results are then generalized to the typical behavior of the average person.

An important methodological distinction in experimental psychology is the research design. In a within-subject design, participants are exposed to several (a minimum of two) situations and the question of interest is whether responses to one situation differ from behavior in other situations. The advantage of this design is that individuals serve as their own controls and variation due to unobserved causes (mood, personality, etc.) does not influence the results. This design can produce high statistical power to study even small effects. The design is often used by cognitive psychologists because the actual behaviors are often simple behaviors (e.g., pressing a button) that can be repeated many times (e.g., to demonstrate interference in the Stroop paradigm).

In a between-subject design, participants are randomly assigned to different conditions. A mean difference between conditions reveals that the experimental manipulation influenced behavior. The advantage of this design is that behavior is not influenced by previous behaviors in the experiment (carry over effects). The disadvantage is that many uncontrolled factors (e..g, mood, personality) also influence behavior. As a result, it can be difficult to detect small effects of an experimental manipulation among all of the other variance that is caused by uncontrolled factors. As a result, between-subject designs require large samples to study small effects or they can only be used to study large effects.

One of the main findings of the OSF-Reproducibility Project was that results from within-subject designs used by cognitive psychology were more likely to replicate than results from between-subject designs used by social psychologists. There were two few between-subject studies by cognitive psychologists or within-subject designs by social psychologists to separate these factors.   This result of the OSF-reproducibility project was predicted by PHP-curves of the actual articles as well as PHP-curves of cognitive and social journals (Replicability-Rankings).

Given the reliable difference between disciplines within psychology, it seems problematic to generalize the results of the OSF-reproducibility project to all areas of psychology. The Replicability-Rankings suggest that social psychology has a lower replicability than other areas of psychology. For this reason, I conducted separate analyses for social psychology and for cognitive psychology. Other areas of psychology had two few studies to conduct a meaningful analysis. Thus, the OSF-reproducibility results should not be generalized to all areas of psychology.

The master data file of the OSF-reproducibilty project contained 167 studies with replication results for 99 studies.   57 studies were classified as social studies. However, this classification used a broad definition of social psychology that included personality psychology and developmental psychology. It included six articles published in the personality section of the Journal of Personality and Social Psychology. As each section functions essentially like an independent journal, I excluded all studies from this section. The file also contained two independent replications of two experiments (experiment 5 and 7) in Albarracín et al. (2008; DOI: 10.1037/a0012833). As the main sampling strategy was to select the last study of each article, I only included Study 7 in the analysis (Study 5 did not replicate, p = .77). Thus, my selection did not lower the rate of successful replications. There were also two independent replications of the same result in Bressan and Stranieri (2008). Both replications produced non-significant results (p = .63, p = .75). I selected the replication study with the larger sample (N = 318 vs. 259). I also excluded two studies that were not independent replications. Rule and Ambady (2008) examined the correlation between facial features and success of CEOs. The replication study had new raters to rate the faces, but used the same faces. Heine, Buchtel, and Norenzayan (2008) examined correlates of conscientiousness across nations and the replication study examined the same relationship across the same set of nations. I also excluded replications of non-significant results because non-significant results provide ambiguous information and cannot be interpreted as evidence for the null-hypothesis. For this reason, it is not clear how the results of a replication study should be interpreted. Two underpowered studies could easily produce consistent results that are both type-II errors. For this reason, I excluded Ranganath and Nosek (2008) and Eastwick and Finkel (2008). The final sample consisted of 38 articles.

I first conducted a post-hoc-power analysis of the reported original results. Test statistics were first converted into two-tailed p-values and two-tailed p-values were converted into absolute z-scores using the formula (1 – norm.inverse(1-p/2). Post-hoc power was estimated by fitting the observed z-scores to predicted z-scores with a mixed-power model with three parameters (Brunner & Schimmack, in preparation).

Estimated power was 35%. This finding reflects the typical finding that reported results are a biased sample of studies that produced significant results, whereas non-significant results are not submitted for publication. Based on this estimate, one would expect that only 35% of the 38 findings (k = 13) would produce a significant result in an exact replication study with the same design and sample size.

PHP-Curve OSF-REP-Social-Original

The Figure visualizes the discrepancy between observed z-scores and the success rate in the original studies. Evidently, the distribution is truncated and the mode of the curve (it’s highest point) is projected to be on the left side of the significance criterion (z = 1.96, p = .05 (two-tailed)). Given the absence of reliable data in the range from 0 to 1.96, the data make it impossible to estimate the exact distribution in this region, but the step decline of z-scores on the right side of the significance criterion suggests that many of the significant results achieved significance only with the help of inflated observed effect sizes. As sampling error is random, these results will not replicate again in a replication study.

The replication studies had different sample sizes than the original studies. This makes it difficult to compare the prediction to the actual success rate because the actual success rate could be much higher if the replication studies had much larger samples and more power to replicate effects. For example, if all replication studies had sample sizes of N = 1,000, we would expect a much higher replication rate than 35%. The median sample size of the original studies was N = 86. This is representative of studies in social psychology. The median sample size of the replication studies was N = 120. Given this increase in power, the predicted success rate would increase to 50%. However, the increase in power was not uniform across studies. Therefore, I used the p-values and sample size of the replication study to compute the z-score that would have been obtained with the original sample size and I used these results to compare the predicted success rate to the actual success rate in the OSF-reproducibility project.

The depressing finding was that the actual success rate was much lower than the predicted success rate. Only 3 out of 38 results (8%) produced a significant result (without the correction of sample size 5 findings would have been significant). Even more depressing is the fact that a 5% criterion, implies that every 20 studies are expected to produce a significant result just by chance. Thus, the actual success rate is close to the success rate that would be expected if all of the original results were false positives. A success rate of 8% would imply that the actual power of the replication studies was only 8%, compared to the predicted power of 35%.

The next figure shows the post-hoc-power curve for the sample-size corrected z-scores.

PHP-Curve OSF-REP-Social-AdjRep

The PHP-Curve estimate of power for z-scores in the range from 0 to 4 is 3% for the homogeneous case. This finding means that the distribution of z-scores for 36 of the 38 results is consistent with the null-hypothesis that the true effect size for these effects is zero. Only two z-scores greater than 4 (one shown, the other greater than 6 not shown) appear to be replicable and robust effects.

One replicable finding was obtained in a study by Halevy, Bornstein, and Sagiv. The authors demonstrated that allocation of money to in-group and out-group members is influenced much more by favoring the in-group than by punishing the out-group. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.

The other successful replication was a study by Lemay and Clark (DOI: 10.1037/0022-3514.94.4.647). The replicated finding was that participants’ projected their own responsiveness in a romantic relationship onto their partners’ responsiveness while controlling for partners’ actual responsiveness. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.

Based on weak statistical evidence in the original studies, I had predicted failures of replication for 25 studies. Given the low success rate, it is not surprising that my success rate was 100.

I made the wrong prediction for 11 results. In all cases, I predicted a successful replication when the outcome was a failed replication. Thus, my overall success rate was 27/38 = 71%. Unfortunately, this success rate is easily beaten by a simple prediction rule that nothing in social psychology replicates, which is wrong in only 3 out of 38 predictions (89% success rate).

Below I briefly comment on the 11 failed predictions.

1   Based on strong statistics (z > 4), I had predicted a successful replication for Förster, Liberman, and Kuschel (DOI: 10.1037/0022-3514.94.4.579). However, even when I made this predictions based on the reported statistics, I had my doubts about this study because statisticians had discovered anomalies in Jens Förster’s studies that cast doubt on the validity of these reported results. Post-hoc power analysis can correct for publication bias, but it cannot correct for other sources of bias that lead to vastly inflated effect sizes.

2   I predicted a successful replication of Payne, MA Burkley, MB Stokes. The replication study actually produced a significant result, but it was no longer significant after correcting for the larger sample size in the replication study (180 vs. 70, p = .045 vs. .21). Although the p-value in the replication study is not very reassuring, it is possible that this is a real effect. However, the original result was probably still inflated by sampling error to produce a z-score of 2.97.

3   I predicted a successful replication of McCrae (DOI: 10.1037/0022-3514.95.2.274). This prediction was based on a transcription error. Whereas the z-score for the target effect was 1.80, I posted a z-score of 3.5. Ironically, the study did successfully replicate with a larger sample size, but the effect was no longer significant after adjusting the result for sample size (N = 61 vs. N = 28). This study demonstrates that marginally significant effects can reveal real effects, but it also shows that larger samples are needed in replication studies to demonstrate this.

4   I predicted a successful replication for EP Lemay, MS Clark (DOI: 10.1037/0022-3514.95.2.420). This prediction was based on a transcription error because EP Lemay and MS Clark had another study in the project. With the correct z-score of the original result (z = 2.27), I would have predicted correctly that the result would not replicate.

5  I predicted a successful replication of Monin, Sawyer, and Marquez (DOI: 10.1037/0022-3514.95.1.76) based on a strong result for the target effect (z = 3.8). The replication study produced a z-score of 1.45 with a sample size that was not much larger than the original study (N = 75 vs. 67).

6  I predicted a successful replication for Shnabel and Nadler (DOI: 10.1037/0022-3514.94.1.116). The replication study increased sample size by 50% (Ns = 141 vs. 94), but the effect in the replication study was modest (z = 1.19).

7  I predicted a successful replication for van Dijk, van Kleef, Steinel, van Beest (DOI: 10.1037/0022-3514.94.4.600). The sample size in the replication study was slightly smaller than in the original study (N = 83 vs. 103), but even with adjustment the effect was close to zero (z = 0.28).

8   I predicted a successful replication of V Purdie-Vaughns, CM Steele, PG Davies, R Ditlmann, JR Crosby (DOI: 10.1037/0022-3514.94.4.615). The original study had rather strong evidence (z = 3.35). In this case, the replication study had a much larger sample than the original study (N = 1,490 vs. 90) and still did not produce a significant result.

9  I predicted a successful replication of C Farris, TA Treat, RJ Viken, RM McFall (doi:10.1111/j.1467-9280.2008.02092.x). The replication study had a somewhat smaller sample (N = 144 vs. 280), but even with adjustment of sample size the effect in the replication study was close to zero (z = 0.03).

10   I predicted a successful replication of KD Vohs and JW Schooler (doi:10.1111/j.1467-9280.2008.02045.x)). I made this prediction of generally strong statistics, although the strength of the target effect was below 3 (z = 2.8) and the sample size was small (N = 30). The replication study doubled the sample size (N = 58), but produced weak evidence (z = 1.08). However, even the sample size of the replication study is modest and does not allow strong conclusions about the existence of the effect.

11   I predicted a successful replication of Blankenship and Wegener (DOI: 10.1037/0022-3514.94.2.94.2.196). The article reported strong statistics and the z-score for the target effect was greater than 3 (z = 3.36). The study also had a large sample size (N = 261). The replication study also had a similarly large sample size (N = 251), but the effect was much smaller than in the original study (z = 3.36 vs. 0.70).

In some of these failed predictions it is possible that the replication study failed to reproduce the same experimental conditions or that the population of the replication study differs from the population of the original study. However, there are twice as many studies where the failure of replication was predicted based on weak statistical evidence and the presence of publication bias in social psychology journals.

In conclusion, this set of results from a representative sample of articles in social psychology reported a 100% success rate. It is well known that this success rate can only be achieved with selective reporting of significant results. Even the inflated estimate of median observed power is only 71%, which shows that the success rate of 100% is inflated. A power estimate that corrects for inflation suggested that only 35% of results would replicate, and the actual success rate is only 8%. While mistakes by the replication experimenters may contribute to the discrepancy between the prediction of 35% and the actual success rate of 8%, it was predictable based on the results in the original studies that the majority of results would not replicate in replication studies with the same sample size as the original studies.

This low success rate is not characteristic of other sciences and other disciplines in psychology. As mentioned earlier, the success rate for cognitive psychology is higher and comparisons of psychological journals show that social psychology journals have lower replicability than other journals. Moreover, an analysis of time trends shows that replicability of social psychology journals has been low for decades and some journals even show a negative trend in the past decade.

The low replicability of social psychology has been known for over 50 years, when Cohen examined the replicability of results published in the Journal of Social and Abnormal Psychology (now Journal of Personality and Social Psychology), the flagship journal of social psychology. Cohen estimated a replicability of 60%. Social psychologists would rejoice if the reproducibility project had shown a replication rate of 60%. The depressing result is that the actual replication rate was 8%.

The main implication of this finding is that it is virtually impossible to trust any results that are being published in social psychology journals. Yes, two articles that posted strong statistics (z > 4) replicated, but several results with equally strong statistics did not replicate. Thus, it is reasonable to distrust all results with z-scores below 4 (4 sigma rule), but not all results with z-scores greater than 4 will replicate.

Given the low credibility of original research findings, it will be important to raise the quality of social psychology by increasing statistical power. It will also be important to allow publication of non-significant results to reduce the distortion that is created by a file-drawer filled with failed studies. Finally, it will be important to use stronger methods of bias-correction in meta-analysis because traditional meta-analysis seemed to show strong evidence even for incredible effects like premonition for erotic stimuli (Bem, 2011).

In conclusion, the OSF-project demonstrated convincingly that many published results in social psychology cannot be replicated. If social psychology wants to be taken seriously as a science, it has to change the way data are collected, analyzed, and reported and demonstrate replicability in a new test of reproducibility.

The silver lining is that a replication rate of 8% is likely to be an underestimation and that regression to the mean alone might lead to some improvement in the next evaluation of social psychology.

Using the R-index to detect questionable research practices in SSRI studies

Amna Shakil and Ulrich Schimmack

In 2008, Turner and colleagues (2008) examined the presence of publication bias in clinical trials of antidepressants. They found that out of 74 FDA-registered studies, 51% showed positive results. However, positive results were much more likely to be published, as 94% of the published results were positive. There were two reasons for the inflated percentage of positive results. First, negative results were not published. Second, negative results were published as positive results. Turner and colleagues’ (2008) results received a lot of attention and cast doubt on the effectiveness of anti-depressants.

A year after Turner and colleagues (2008) published their study, Moreno, Sutton, Turner, Abrams, Cooper and Palmer (2009) examined the influence of publication bias on the effect-size estimate in clinical trials of antidepressants. They found no evidence of publication bias in the FDA-registered trials, leading the researchers to conclude that the FDA data provide an unbiased gold standard to examine biases in the published literature.

The effect size for treatment with anti-depressants in the FDA data was g = 0.31, 95% confidence interval 0.27 to 0.35. In contrast, the uncorrected average effect size in the published studies was g = 0.41, 95% confidence interval 0.37 to 0.45. This finding shows that publication bias inflates effect size estimates by 32% ((0.41 – 0.31)/0.31).

Moreno et al. (2009) also used regression analysis to obtain a corrected effect size estimate based on the biased effect sizes in the published literature. In this method, effect sizes are regressed on sampling error under the assumption that studies with smaller samples (and larger sampling error) have more bias. The intercept is used as an estimate of the population effect size when sampling error is zero. This correction method yielded an effect size estimate of g = 0.29, 95% confidence interval 0.23 to 0.35, which is similar to the gold standard estimate (.31).

The main limitation of the regression method is that other factors can produce a correlation between sample size and effect size (e.g., higher quality studies are more costly and use smaller samples). To avoid this problem, we used an alternative correction method that does not make this assumption.

The method uses the R-Index to examine bias in a published data set. The R-Index increases as statistical power increases and it decreases when publication bias is present. To obtain an unbiased effect size estimate, studies are selected to maximize the R-Index.

Since the actual data files were not available, graphs A and B from Moreno et al.’s (2009) study were used to obtain information about effect size and sample error of all the FDA-registered and the published journal articles.

The FDA-registered studies had the success rate of 53% and the observed power of 56%, resulting in an inflation of close to 0. The close match between the success rate and observed confirms FDA studies are not biased. Given the lack of bias (inflation), the most accurate estimate of the effect size is obtained by using all studies.

The published journal articles had a success rate of 86% and the observed power of 73%, resulting in the inflation rate of 12%. The inflation rate of 12% confirms that the published data set is biased. The R-Index subtracts the inflation rate from observed power to correct for inflation. Thus, the R-Index for the published studies is 73-12 = 61. The weighted effect size estimate was d = .40.

The next step was to select sets of studies to maximize the R-Index. As most studies were significant, the success rate could not change much. As a result, most of the increase would be achieved by selecting studies with higher sample sizes in order to increase power. The maximum R-Index was obtained for a cut-off point of N = 225. This left 14 studies with a total sample size of 4,170 participants. The success rate was 100% with median observed power of 85%. The Inflation was still 15%, but the R-Index was higher than it was for the full set of studies (70 vs. 61). The weighted average effect size in the selected set of powerful studies was d = .34. This result is very similar to the gold standard in the FDA data. The small discrepancy can be attributed to the fact that even studies with 85% power still have a small bias in the estimation of the true effect size.

In conclusion, our alternative effect size estimation procedure confirms Moreno et al.’s (2009) results using an alternative bias-correction method and shows that the R-Index can be a valuable tool to detect and correct for publication bias in other meta-analyses.

These results have important practical implications. The R-Index confirms that published clinical trials are biased and can provide false information about the effectiveness of drugs. It is therefore important to ensure that clinical trials are preregistered and that all results of clinical trials are published. The R-Index can be used to detect violations of these practices that lead to biased evidence. Another important finding is that clinical trials of antidepressants do show effectiveness and that antidepressants can be used as effective treatments of depression. The presence of publication bias should not be used to claim that antidepressants lack effectiveness.

References

Moreno, S. G., Sutton, A. J., Turner, E. H., Abrams, K. R., Cooper, N. J., Palmer, T. M., & Ades, A. E. (2009). Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publications. Bmj, 339, b2981.

Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358(3), 252-260.

When Exact Replications Are Too Exact: The Lucky-Bounce-Test for Pairs of Exact Replication Studies

Imagine an NBA player has an 80% chance to make one free throw. What is the chance that he makes both free throws? The correct answer is 64% (80% * 80%).

Now consider the possibility that it is possible to distinguish between two types of free throws. Some free throws are good; they don’t touch the rim and make a swishing sound when they go through the net (all net). The other free throws bounce of the rim and go in (rattling in).

What is the probability that an NBA player with an 80% free throw percentage makes a free throw that is all net or rattles in? It is more likely that an NBA player with an 80% free throw average makes a perfect free throw because a free throw that rattles in could easily have bounded the wrong way, which would lower the free throw percentage. To achieve an 80% free throw percentage, most free throws have to be close to perfect.

Let’s say the probability of hitting the rim and going in is 30%. With an 80% free throw average, this means that the majority of free throws are in the close-to-perfect category (20% misses, 30% rattle-in, 50% close-to-perfect).

What does this have to do with science? A lot!

The reason is that the outcome of a scientific study is a bit like throwing free throws. One factor that contributes to a successful study is skill (making correct predictions, avoiding experimenter errors, and conducting studies with high statistical power). However, another factor is random (a lucky or unlucky bounce).

The concept of statistical power is similar to an NBA players’ free throw percentage. A researcher who conducts studies with 80% statistical power is going to have an 80% success rate (that is, if all predictions are correct). In the remaining 20% of studies, a study will not produce a statistically significant result, which is equivalent to missing a free throw and not getting a point.

Many years ago, Jacob Cohen observed that researchers often conduct studies with relatively low power to produce a statistically significant result. Let’s just assume right now that a researcher conducts studies with 60% power. This means, researchers would be like NBA players with a 60% free-throw average.

Now imagine that researchers have to demonstrate an effect not only once, but also a second time in an exact replication study. That is researchers have to make two free throws in a row. With 60% power, the probability to get two significant results in a row is only 36% (60% * 60%). Moreover, many of the freethrows that are made rattle in rather than being all net. The percentages are about 40% misses, 30% rattling in and 30% all net.

One major difference between NBA players and scientists is that NBA players have to demonstrate their abilities in front of large crowds and TV cameras, whereas scientists conduct their studies in private.

Imagine an NBA player could just go into a private room, throw two free throws and then report back how many free throws he made and the outcome of these free throws determine who wins game 7 in the playoff finals. Would you trust the player to tell the truth?

If you would not trust the NBA player, why would you trust scientists to report failed studies? You should not.

It can be demonstrated statistically that scientists are reporting more successes than the power of their studies would justify (Sterling et al., 1995; Schimmack, 2012). Amongst scientists this fact is well known, but the general public may not fully appreciate the fact that a pair of exact replication studies with significant results is often just a selection of studies that included failed studies that were not reported.

Fortunately, it is possible to use statistics to examine whether the results of a pair of studies are likely to be honest or whether failed studies were excluded. The reason is that an amateur is not only more likely to miss a free throw. An amateur is also less likely to make a perfect free throw.

Based on the theory of statistical power developed by Nyman and Pearson and popularized by Jacob Cohen, it is possible to make predictions about the relative frequency of p-values in the non-significant (failure), just significant (rattling in), and highly significant (all net) ranges.

As for made-free-throws, the distinction between lucky and clear successes is somewhat arbitrary because power is continuous. A study with a p-value of .0499 is very lucky because p = .501 would have been not significant (rattled in after three bounces on the rim). A study with p = .000001 is a clear success. Lower p-values are better, but where to draw the line?

As it turns out, Jacob Cohen’s recommendation to conduct studies with 80% power provides a useful criterion to distinguish lucky outcomes and clear successes.

Imagine a scientist conducts studies with 80% power. The distribution of observed test-statistics (e.g. z-scores) shows that this researcher has a 20% chance to get a non-significant result, a 30% chance to get a lucky significant result (p-value between .050 and .005), and a 50% chance to get a clear significant result (p < .005). If the 20% failed studies are hidden, the percentage of results that rattled in versus studies with all-net results are 37 vs. 63%. However, if true power is just 20% (an amateur), 80% of studies fail, 15% rattle in, and 5% are clear successes. If the 80% failed studies are hidden, only 25% of the successful studies are all-net and 75% rattle in.

One problem with using this test to draw conclusions about the outcome of a pair of exact replication studies is that true power is unknown. To avoid this problem, it is possible to compute the maximum probability of a rattling-in result. As it turns out, the optimal true power to maximize the percentage of lucky outcomes is 66% power. With true power of 66%, one would expect 34% misses (p > .05), 32% lucky successes (.050 < p < .005), and 34% clear successes (p < .005).

LuckyBounceTest

For a pair of exact replication studies, this means that there is only a 10% chance (32% * 32%) to get two rattle-in successes in a row. In contrast, there is a 90% chance that misses were not reported or that an honest report of successful studies would have produced at least one all-net result (z > 2.8, p < .005).

Example: Unconscious Priming Influences Behavior

I used this test to examine a famous and controversial set of exact replication studies. In Bargh, Chen, and Burrows (1996), Dr. Bargh reported two exact replication studies (studies 2a and 2b) that showed an effect of a subtle priming manipulation on behavior. Undergraduate students were primed with words that are stereotypically associated with old age. The researchers then measured the walking speed of primed participants (n = 15) and participants in a control group (n = 15).

The two studies were not only exact replications of each other; they also produced very similar results. Most readers probably expected this outcome because similar studies should produce similar results, but this false belief ignores the influence of random factors that are not under the control of a researcher. We do not expect lotto winners to win the lottery again because it is an entirely random and unlikely event. Experiments are different because there could be a systematic effect that makes a replication more likely, but in studies with low power results should not replicate exactly because random sampling error influences results.

Study 1: t(28) = 2.86, p = .008 (two-tailed), z = 2.66, observed power = 76%
Study 2: t(28) = 2.16, p = .039 (two-tailed), z = 2.06, observed power = 54%

The median power of these two studies is 65%. However, even if median power were lower or higher, the maximum probability of obtaining two p-values in the range between .050 and .005 remains just 10%.

Although this study has been cited over 1,000 times, replication studies are rare.

One of the few published replication studies was reported by Cesario, Plaks, and Higgins (2006). Naïve readers might take the significant results in this replication study as evidence that the effect is real. However, this study produced yet another lucky success.

Study 3: t(62) = 2.41, p = .019, z = 2.35, observed power = 65%.

The chances of obtaining three lucky successes in a row is only 3% (32% *32% * 32*). Moreover, with a median power of 65% and a reported success rate of 100%, the success rate is inflated by 35%. This suggests that the true power of the reported studies is considerably lower than the observed power of 65% and that observed power is inflated because failed studies were not reported.

The R-Index corrects for inflation by subtracting the inflation rate from observed power (65% – 35%). This means the R-Index for this set of published studies is 30%.

This R-Index can be compared to several benchmarks.

An R-Index of 22% is consistent with the null-hypothesis being true and failed attempts are not reported.

An R-Index of 40% is consistent with 30% true power and all failed attempts are not reported.

It is therefore not surprising that other researchers were not able to replicate Bargh’s original results, even though they increased statistical power by using larger samples (Pashler et al. 2011, Doyen et al., 2011).

In conclusion, it is unlikely that Dr. Bargh’s original results were the only studies that they conducted. In an interview, Dr. Bargh revealed that the studies were conducted in 1990 and 1991 and that they conducted additional studies until the publication of the two studies in 1996. Dr. Bargh did not reveal how many studies they conducted over the span of 5 years and how many of these studies failed to produce significant evidence of priming. If Dr. Bargh himself conducted studies that failed, it would not be surprising that others also failed to replicate the published results. However, in a personal email, Dr. Bargh assured me that “we did not as skeptics might presume run many studies and only reported the significant ones. We ran it once, and then ran it again (exact replication) in order to make sure it was a real effect.” With a 10% probability, it is possible that Dr. Bargh was indeed lucky to get two rattling-in findings in a row. However, his aim to demonstrate the robustness of an effect by trying to show it again in a second small study is misguided. The reason is that it is highly likely that the effect will not replicate or that the first study was already a lucky finding after some failed pilot studies. Underpowered studies cannot provide strong evidence for the presence of an effect and conducting multiple underpowered studies reduces the credibility of successes because the probability of this outcome to occur even when an effect is present decreases with each study (Schimmack, 2012). Moreover, even if Bargh was lucky to get two rattling-in results in a row, others will not be so lucky and it is likely that many other researchers tried to replicate this sensational finding, but failed to do so. Thus, publishing lucky results hurts science nearly as much as the failure to report failed studies by the original author.

Dr. Bargh also failed to realize how lucky he was to obtain his results, in his response to a published failed-replication study by Doyen. Rather than acknowledging that failures of replication are to be expected, Dr. Bargh criticized the replication study on methodological grounds. There would be a simple solution to test Dr. Bargh’s hypothesis that he is a better researcher and that his results are replicable when the study is properly conducted. He should demonstrate that he can replicate the result himself.

In an interview, Tom Bartlett asked Dr. Bargh why he didn’t conduct another replication study to demonstrate that the effect is real. Dr. Bargh’s response was that “he is aware that some critics believe he’s been pulling tricks, that he has a “special touch” when it comes to priming, a comment that sounds like a compliment but isn’t. “I don’t think anyone would believe me,” he says.” The problem for Dr. Bargh is that there is no reason to believe his original results, either. Two rattling-in results alone do not constitute evidence for an effect, especially when this result could not be replicated in an independent study. NBA players have to make free-throws in front of a large audience for a free-throw to count. If Dr. Bargh wants his findings to count, he should demonstrate his famous effect in an open replication study. To avoid embarrassment, it would be necessary to increase the power of the replication study because it is highly unlikely that even Dr. Bargh can continuously produce significant results with samples of N = 30 participants. Even if the effect is real, sampling error is simply too large to demonstrate the effect consistently. Knowledge about statistical power is power. Knowledge about post-hoc power can be used to detect incredible results. Knowledge about a priori power can be used to produce credible results.

Swish!

Why are Stereotype-Threat Effects on Women’s Math Performance Difficult to Replicate?

Updated on May 19, 2016
– corrected mistake in calculation of p-value for TIVA

A Replicability Analysis of Spencer, Steele, and Quinn’s seminal article on stereotype threat effects on gender differences in math performance.

Background

In a seminal article, Spencer, Steele, and Quinn (1999) proposed the concept of stereotype threat. They argued that women may experience stereotype-threat during math tests and that stereotype threat can interfere with their performance on math tests.

The original study reported three experiments.

STUDY 1

Study 1 had 56 participants (28 male and 28 female undergraduate students). The main aim was to demonstrate that stereotype-threat influences performance on difficult, but not on easy math problems.

A 2 x 2 mixed model ANOVA with sex and difficulty produced the following results.

Main effect for sex, F(1, 52) = 3.99, p = .051 (reported as p = .05), z = 1.96, observed power = 50%.

Interaction between sex and difficulty, F(1, 52) = 5.34 , p = .025, z = 2.24, observed power = 61%.

The low observed power suggests that sampling error contributed to the significant results. Assuming observed power is a reliable estimate of true power, the chance of obtaining significant results in both studies would only be 31%. Moreover, if the true power is in the range between 50% and 80% power, there is only a 32% chance that observed power to fall into this range. The chance that both observed power values fall into this range is only 10%.

Median observed power is 56%. The success rate is 100%. Thus, the success rate is inflated by 44 percentage points (100% – 56%).

The R-Index for these two results is low, Ř = 12 (56 – 44).

Empirical evidence shows that studies with low R-Indices often fail to replicate in exact replication studies.

It is even more problematic that Study 1 was supposed to demonstrate just the basic phenomenon that women perform worse on math problems than men and that the following studies were designed to move this pre-existing gender difference around with an experimental manipulation. If the actual phenomenon is in doubt, it is unlikely that experimental manipulations of the phenomenon will be successful.

STUDY 2

The main purpose of Study 2 was to demonstrate that gender differences in math performance would disappear when the test is described as gender neutral.

Study 2 recruited 54 students (30 women, 24 men). This small sample size is problematic for several reasons. Power analysis of Study 1 suggested that the authors were lucky to obtain significant results. If power is 50%, there is a 50% chance that an exact replication study with the same sample size will produce a non-significant result. Another problem is that sample sizes need to increase to demonstrate that the gender difference in math performance can be influenced experimentally.

The data were not analyzed according to this research plan because the second test was so difficult that nobody was able to solve these math problems. However, rather than repeating the experiment with a better selection of math problems, the results for the first math test were reported.

As there was no repeated performance by the two participants, this is a 2 x 2 between-subject design that crosses sex and treat-manipulation. With a total sample size of 54 students, the n per cell is 13.

The main effect for sex was significant, F(1, 50) = 5.66, p = .021, z = 2.30, observed power = 63%.

The interaction was also significant, F(1, 50) = 4.18, p = .046, z = 1.99, observed power = 51%.

Once more, median observed power is just 57%, yet the success rate is 100%. Thus, the success rate is inflated by 43% and the R-Index is low, Ř = 14%, suggesting that an exact replication study will not produce significant results.

STUDY 3

Studies 1 and 2 used highly selective samples (women in the top 10% in math performance). Study 3 aimed to replicate the results of Study 2 in a less selective sample. One might expect that stereotype-threat has a weaker effect on math performance in this sample because stereotype threat can undermine performance when ability is high, but anxiety is not a factor in performance when ability is low. Thus, Study 3 is expected to yield a weaker effect and a larger sample size would be needed to demonstrate the effect. However, sample size was approximately the same as in Study 2 (36 women, 31 men).

The ANOVA showed a main effect of sex on math performance, F(1, 63) = 6.44, p = .014, z = 2.47, observed power = 69%.

The ANOVA also showed a significant interaction between sex and stereotype-threat-assurance, F(1, 63) = 4.78, p = .033, z = 2.14, observed power = 57%.

Once more, the R-Index is low, Ř = 26 (MOP = 63%, Success Rate = 100%, Inflation Rate = 37%).

Combined Analysis

The three studies reported six statistical tests. The R-Index for the combined analysis is low Ř = 18 (MOP = 59%, Success Rate = 100%, Inflation Rate = 41%).

The probability of this event to occur by chance can be assessed with the Test of Insufficient Variance (TIVA). TIVA tests the hypothesis that the variance in p-values, converted into z-scores, is less than 1. A variance of one is expected in a set of exact replication studies with fixed true power. Less variance suggests that the z-scores are not a representative sample of independent test scores.   The variance of the six z-scores is low, Var(z) = .04, p < .001,  1 / 1309.

Correction: I initially reported, “A chi-square test shows that the probability of this event is less than 1 out of 1,000,000,000,000,000, chi-square (df = 5) = 105.”

I made a mistake in the computation of the probability. When I developed TIVA, I confused the numerator and denominator in the test. I was thrilled that the test was so powerful and happy to report the result in bold, but it is incorrect. A small sample of six z-scores cannot produce such low p-values.

Conclusion

The replicability analysis of Spencer, Steele, and Quinn (1999) suggests that the original data provided inflated estimates of effect sizes and replicability. Thus, the R-Index predicts that exact replication studies would fail to replicate the effect.

Meta-Analysis

A forthcoming article in the Journal of School Psychology reports the results of a meta-analysis of stereotype-threat studies in applied school settings (Flore & Wicherts, 2014). The meta-analysis was based on 47 comparisons of girls with stereotype threat versus girls without stereotype threat. The abstract concludes that stereotype threat in this population is a statistically reliable, but small effect (d = .22). However, the authors also noted signs of publication bias. As publication bias inflates effect sizes, the true effect size is likely to be even smaller than the uncorrected estimate of .22.

The article also reports that the after a correction for bias, using the trim-and-fill method, the estimated effect size is d = .07 and not significantly different from zero. Thus, the meta-analysis reveals that there is no replicable evidence for stereotype-threat effects on schoolgirls’ math performance. The meta-analysis also implies that any true effect of stereotype threat is likely to be small (d < .2). With a true effect size of d = .2, the original studies by Steel et al. (1999) and most replication studies had insufficient power to demonstrate stereotype threat effects, even if the effect exists. A priori power analysis with d = .2 would suggest that 788 participants are needed to have an 80% chance to obtain a significant result if the true effect is d = .2. Thus, future research on this topic is futile unless statistical power is increased by increasing sample sizes or by using more powerful designs that can demonstrate small effects in smaller samples.

One possibility is that the existing studies vary in quality and that good studies showed the effect reliably, whereas bad studies failed to show the effect. To test this hypothesis, it is possible to select studies from a meta-analysis with the goal to maximize the R-Index. The best chance to obtain a high R-Index is to focus on studies with large sample sizes because statistical power increases with sample size. However, the table below shows that there are only 8 studies with more than 100 participants and the success rate in these studies is 13% (1 out of 8), which is consistent with the median observed power in these studies 12%.

R-IndexStereotypeThreatMetaAnalysis

It is also possible to select studies that produced significant results (z > 1.96). Of course, this set of studies is biased, but the R-Index corrects for bias. If these studies were successful because they had sufficient power to demonstrate effects, the R-Index would be greater than 50%. However, the R-Index is only 49%.

CONCLUSION

In conclusion, a replicability analysis with the R-Index shows that stereotype-threat is an elusive phenomenon. Even large replication studies with hundreds of participants were unable to provide evidence for an effect that appeared to be a robust effect in the original article. The R-Index of the meta-analysis by Flore and Wicherts corroborates concerns that the importance of stereotype-threat as an explanation for gender differences in math performance has been exaggerated. Similarly, Ganley, Mingle, Ryan, Ryan, and Vasilyeva (2013) found no evidence for stereotype threat effects in studies with 931 students and suggested that “these results raise the possibility that stereotype threat may not be the cause of gender differences in mathematics performance prior to college.” (p 1995).

The main novel contribution of this post is to reveal that this disappointing outcome was predicted on the basis of the empirical results reported in the original article by Spencer et al. (1999). The article suggested that stereotype threat is a pervasive phenomenon that explains gender differences in math performance. However, The R-Index and the insufficient variance in statistical results suggest that the reported results were biased and, overestimated the effect size of stereotype threat. The R-Index corrects for this bias and correctly predicts that replication studies will often result in non-significant results. The meta-analysis confirms this prediction.

In sum, the main conclusions that one can draw from 15 years of stereotype-threat research is that (a) the real reasons for gender differences in math performance are still unknown, (b) resources have been wasted in the pursuit of a negligible factor that may contribute to gender differences in math performance under very specific circumstances, and (c) that the R-Index could have prevented the irrational exuberance about stereotype-threat as a simple solution to an important social issue.

In a personal communication Dr. Spencer suggested that studies not included in the meta-analysis might produce different results. I suggested that Dr. Spencer provides a list of studies that provide empirical support for the hypothesis. A year later, Dr. Spencer has not provided any new evidence that provides credible evidence for stereotype-effects.  At present, the existing evidence suggests that published studies provide inflated estimates of the replicability and importance of the effect.

This blog also provides further evidence that male and female psychologists could benefit from a better education in statistics and research methods to avoid wasting resources in the pursuit of false-positive results.

The R-Index of Nicotine-Replacement-Therapy Studies: An Alternative Approach to Meta-Regression

Stanley and Doucouliagos (2013) demonstrated how meta-regression can be used to obtain unbiased estimates of effect sizes from a biased set of original studies. The regression approach relies on the fact that small samples often need luck or questionable practices to produce significant results, whereas large samples can show true effects without the help of luck and questionable practices. If questionable practices or publication bias are present, effect sizes in small samples are inflated and this bias is evident in a regression of effect sizes on sampling error. When bias is present, the intercept of the regression equation can provide a better estimate of the average effect size in a set of studies.

One limitation of this approach is that other factors can also produce a correlation between effect size and sampling error. Another problem is that the regression equation can only approximate the effect of bias on effect size estimates.

The R-Index can complement meta-regression in several ways. First, it can be used to examine whether a correlation between effect size and sampling error reflects bias. If small samples have higher effect sizes due to bias, they should also yield more significant results than the power of these studies justifies. If this is not the case, the correlation may simply show that smaller samples examined stronger effects. Second, the R-Index can be used as an alternative way to estimate unbiased effect sizes that does not rely on the relationship between sample size and effect size.

The usefulness of the R-Index is illustrated with Stanley and Doucouliagos (2013) meta-analysis of the effectiveness of nicotine replacement therapy (the patch). Table A1 lists sampling errors and t-values of 42 studies. Stanley and Doucouliagos (2013) found that the 42 studies suggested a reduction in smoking by 93%, but that effectiveness decreased to 22% in a regression that controlled for biased reporting of results. This suggests that published studies inflate the true effect by more than 300%.

I entered the t-values and standard errors into the R-Index spreadsheet. I used sampling error to estimate sample sizes and degrees of freedom (2 / sqrt [N]). I used one-tailed t-tests to allow for negative t-values because the sign of effects is known in a meta-analysis of studies that try to show treatment effects. Significance was tested using p = .025, which is equivalent to using .050 in the test of significance for two-tailed tests (z > 1.96).

The R-Index for all 42 studies was 27%. The low R-Index was mostly explained by the low power of studies with small samples. Median observed power was just 34%. The number of significant results was only slightly higher 40%. The inflation rate was only 7%.

As studies with low power add mostly noise, Stanley (2010) showed that it can be preferable to exclude them from estimates of actual effect sizes. The problem is that it is difficult to find a principled way to determine which studies should be included or excluded. One solution is to retain only studies with large samples. The problem with this approach is that this often limits a meta-analysis to a small set of studies.

One solution is to compute the R-Index for different sets of studies and to base conclusions on the largest unbiased set of studies. For the 42 studies of nicotine replacement therapy, the following effect size estimates were obtained (effect sizes are d-values, d = t * se).

NicotinePatch

The results show the highest R-Index for studies with more than 80 participants. For these studies, observed power is 83% and the percentage of significant results is also 83%, suggesting that this set of studies is an unbiased sample of studies. The weighted average effect size for this set of studies is d = .44. The results also show that the weighted average effect size does not change much as a function of the selection of studies. When all studies are included, there is evidence of bias (8% inflation) and the weighted average effect size is inflated, but the amount of inflation is small (d = .56 vs. d = .44, difference d = .12).

The small amount of bias appears to be inconsistent with Stanley and Doucouliagos (2013) estimate that an uncorrected meta-analysis overestimates the true effect size by over 300% (93% vs. 22% RR). I therefore also examined the log(RR) values in Table 1a.

The average is .68 (compared to the simple mean reported as .66); the median is .53 and the weighted average is .49.   The regression-corrected estimate reported by Stanley and Doucouliagos (2013) is .31. The weighted mean for studies with more than 80 participants is .43. It is now clear why Stanley and Doucouliagos (2013) reported a large effect of the bias correction. First, they used the simple mean as a comparison standard (.68 vs. 31). The effect would be smaller if they had used the weighted mean as a comparison standard (.49 vs. .31). Another factor is that the regression procedure produces a lower estimate than the R-Index approach (.31 vs. 43). More research is needed to compare these results, but the R-Index has a simple logic. When there is no evidence of bias, the weighted average provides a reasonable estimate of the true effect size.

Conclusion

Stanley and Doucouliagos (2013) used regression of effect sizes on sampling error to reveal biases and to obtain an unbiased estimate of the typical effect size in a set of studies. This approach provides a useful tool in the fight against biased reporting of research results. One limitation of this approach is that other factors can produce a correlation between sampling error and effect size. The R-Index can be used to examine how much reporting biases contribute to this correlation. The R-Index can also be used to obtain an unbiased estimate of effect size by computing a weighted average for a select set of studies with a high R-Index.

A meta-analysis of 42 studies of nicotine replacement theory illustrates this approach. The R-Index for the full set of studies was low (24%). This reveals that many studies had low power to demonstrate an effect. These studies provide little information about effectiveness because non-significant results are just as likely to be type-II errors as demonstrations of low effectiveness.

The R-Index increased when studies with larger samples were selected. The maximum R-Index was obtained for studies with at least 80 participants. In this case, observed power was above 80% and there was no evidence of bias. The weighted average effect size for this set of studies was only slightly lower than the weighted average effect size for all studies (log(RR) = .43 vs. .49, RR = 54% vs. 63%, respectively). This finding suggests that smokers who use a nicotine patch are about 50% more likely to quit smoking than smokers without a nicotine patch.

The estimate of 50% risk reduction challenges Stanley and Doucouliagos’s (2013) preferred estimate that bias correction “reduces the efficacy of the patch to only 22%.” The R-Index suggests that this bias-corrected estimate is itself biased.

Another important conclusion is that studies with low power are wasteful and uninformative. They generate a lot of noise and are likely to be systematically biased and they contribute little to a meta-analysis that weights studies by sample size. The best estimate of effect size was based on only 6 out of 42 studies. Researchers should not conduct studies with low power and editors should not publish studies with low power.