All posts by Dr. R

About Dr. R

Since Cohen (1962) published his famous article on statistical power in psychological journals, statistical power has not increased. The R-Index makes it possible f to distinguish studies with high power (good science) and studies with low power (bad science). Protect yourself from bad science and check the R-Index before you believe statistical results.

Examining the Replicability of 66,212 Published Results in Social Psychology: A Post-Hoc-Power Analysis Informed by the Actual Success Rate in the OSF-Reproducibilty Project

The OSF-Reproducibility-Project examined the replicability of 99 statistical results published in three psychology journals. The journals covered mostly research in cognitive psychology and social psychology. An article in Science, reported that only 35% of the results were successfully replicated (i.e., produced a statistically significant result in the replication study).

I have conducted more detailed analyses of replication studies in social psychology and cognitive psychology. Cognitive psychology had a notably higher success rate (50%, 19 out of 38) than social psychology (8%, 3 out of 38). The main reason for this discrepancy is that social psychologists and cognitive psychologists use different designs. Whereas cognitive psychologists typically use within-subject designs with many repeated measurements of the same individual, social psychologists typically assign participants to different groups and compare behavior on a single measure. This so-called between-subject design makes it difficult to detect small experimental effects because it does not control the influence of other factors that influence participants’ behavior (e.g., personality dispositions, mood, etc.). To detect small effects in these noisy data, between-subject designs require large sample sizes.

It has been known for a long time that sample sizes in between-subject designs in psychology are too small to have a reasonable chance to detect an effect (less than 50% chance to find an effect that is actually there) (Cohen, 1962; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989). As a result, many studies fail to find statistically significant results, but these studies are not submitted for publication. Thus, only studies that achieved statistical significance with the help of chance (the difference between two groups is inflated by uncontrolled factors such as personality) are reported in journals. The selective reporting of lucky results creates a bias in the published literature that gives a false impression of the replicability of published results. The OSF-results for social psychology make it possible to estimate the consequences of publication bias on the replicability of results published in social psychology journals.

A naïve estimate of the replicability of studies would rely on the actual success rate in journals. If journals would publish significant and non-significant results, this would be a reasonable approach. However, journals tend to publish exclusively significant results. As a result, the success rate in journals (over 90% significant results; Sterling, 1959; Sterling et al., 1995) gives a drastically inflated estimate of replicability.

A somewhat better estimate of replicability can be obtained by computing post-hoc power based on the observed effect sizes and sample sizes of published studies. Statistical power is the long-run probability that a series of exact replication studies with the same sample size would produce significant results. Cohen (1962) estimated that the typical power of psychological studies is about 60%. Thus, even for 100 studies that all reported significant results, only 60 are expected to produce a significant result again in the replication attempt.

The problem with Cohen’s (1962) estimate of replicability is that post-hoc-power analysis uses the reported effect sizes as an estimate of the effect size in the population. However, due to the selection bias in journals, the reported effect sizes and power estimates are inflated. In collaboration with Jerry Brunner, I have developed an improved method to estimate typical power of reported results that corrects for the inflation in reported effect sizes. I applied this method to results from 38 social psychology articles included in the OSF-reproducibility project and obtained a replicability estimate of 35%.

The OSF-reproducbility project provides another opportunity to estimate the replicability of results in social psychology. The OSF-project selected a representative set of studies from two journals and tried to reproduce the same experimental conditions as closely as possible. This should produce unbiased results and the success rate provides an estimate of replicability. The advantage of this method is that it does not rely on statistical assumptions. The disadvantage is that the success rate depends on the ability to exactly recreate the conditions of the original studies. Any differences between studies (e.g., recruiting participants from different populations) can change the success rate. The OSF replication studies also often changed the sample size of the replication study, which will also change the success rate. If sample sizes in a replication study are larger, power increases and the success rate no longer can be used as an estimate of the typical replicability of social psychology. To address this problem, it is possible to apply a statistical adjustment and use the success rate that would have occurred with the original sample sizes. I found that 5 out of 38 (13%) produced significant results and after correcting for the increase in sample size, replicability was only 8% (3 out of 38).

One important question is how how representative the 38 results from the OSF-project are for social psychology in general. Unfortunately, it is practically impossible and too expensive to conduct a large number of exact replication studies. In comparison, it is relatively easy to apply post-hoc power analysis to a large number of statistical results reported in social psychology. Thus, I examined the representativeness of the OSF-reproducibility results by comparing the results of my post-hoc power analysis based on the 38 results in the OSF to a post-hoc-power analysis of a much larger number of results reported in major social psychology journals .

I downloaded articles from 12 social psychology journals, which are the primary outlets for publishing experimental social psychology research: Basic and Applied Social Psychology, British Journal of Social Psychology, European Journal of Social Psychology, Journal of Experimental Social Psychology, Journal of Personality and Social Psychology: Attitudes and Social Cognition, Journal of Personality and Social Psychology: Interpersonal Relationships and Group Processes, Journal of Social and Personal Relationships, Personal Relationships, Personality and Social Psychology Bulletin, Social Cognition, Social Psychology and Personality Science, Social Psychology.

I converted pdf files into text files and searched for all reports of t-tests or F-tests and converted the reported test-statistic into exact two-tailed p-values. The two-tailed p-values were then converted into z-scores by finding the z-score corresponding to the probability of 1-p/2, with p equal the two-tailed p-value. The total number of z-scores included in the analysis is 134,929.

I limited my estimate of power to z-scores in the range between 2 and 4. Z-scores below 2 are not statistically significant (z = 1.96, p = .05). Sometimes these results are reported as marginal evidence for an effect, sometimes they are reported as evidence that an effect is not present, and sometimes they are reported without an inference about the population effect. It is more important to determine the replicability of results that are reported as statistically significant support for a prediction. Z-scores greater than 4 were excluded because z-scores greater than 4 imply that this test had high statistical power (> 99%). Many of these results replicated successfully in the OSF-project. Thus, a simple rule is to assign a success rate of 100% to these findings. The Figure below shows the distribution of z-scores in the range form z = 0 to6, but the power estimate is applied to z-scores in the range between 2 and 4 (n = 66,212).

PHP-Curve Social Journals

The power estimate based on the post-hoc-power curve for z-scores between 2 and 4 is 46%. It is important to realize that this estimate is based on 70% of all significant results that were reported. As z-scores greater than 4 essentially have a power of 100%, the overall power estimate for all statistical tests that were reported is .46*.70 + .30 = .62. It is also important to keep in mind that this analysis uses all statistical tests that were reported including manipulation checks (e.g., pleasant picture were rated as more pleasant than unpleasant pictures). For this reason, the range of z-scores is limited to values between 2 and 4, which is much more likely to reflect a test of a focal hypothesis.

46% power for z-scores between 2 and 4 of is a higher estimate than the estimate for the 38 studies in the OSF-reproducibility project (35%). This suggests that the estimated replicability based on the OSF-results is an underestimation of the true replicability. The discrepancy between predicted and observed replicability in social psychology (8 vs. 38) and cognitive psychology (50 vs. 75), suggests that the rate of actual successful replications is about 20 to 30% lower than the success rate based on statistical prediction. Thus, the present analysis suggests that actual replication attempts of results in social psychology would produce significant results in about a quarter of all attempts (46% – 20% = 26%).

The large sample of test results makes it possible to make more detailed predictions for results with different strength of evidence. To provide estimates of replicability for different levels of evidence, I conducted post-hoc power analysis for intervals of half a standard deviation (z = .5). The power estimates are:

Strength of Evidence      Power    

2.0 to 2.5                            33%

2.5 to 3.0                            46%

3.0 to 3.5                            58%

3.5 to 4.0                            72%

IMPLICATIONS FOR PLANNING OF REPLICATION STUDIES

These estimates are important for researchers who are aiming to replicate a published study in social psychology. The reported effect sizes are inflated and a replication study with the same sample size has a low chance to produce a significant result even if a smaller effect exists.   To conducted a properly powered replication study, researchers would have to increase sample sizes. To illustrate, imagine that a study demonstrate a significant difference between two groups with 40 participants (20 in each cell) with a z-score of 2.3 (p = .02, two-tailed). The observed power for this result is 65% and it would suggest that a slightly larger sample of N = 60 is sufficient to achieve 80% power (80% chance to get a significant result). However, after correcting for bias, the true power is more likely to be just 33% (see table above) and power for a study with N = 60 would still only be 50%. To achieve 80% power, the replication study would need a sample size of 130 participants. Sample sizes would need to be even larger taking into account that the actual probability of a successful replication is even lower than the probability based on post-hoc power analysis. In the OSF-project only 1 out of 30 studies with an original z-score between 2 and 3 was successfully replicated.

IMPLICATIONS FOR THE EVALUATION OF PUBLISHED RESULTS

The results also have implications for the way social psychologists should conduct and evaluate new research. The main reason why z-scores between 2 and 3 provide untrustworthy evidence for an effect is that they are obtained with underpowered studies and publication bias. As a result, it is likely that the strength of evidence is inflated. If, however, the same z-scores were obtained in studies with high power, a z-score of 2.5 would provide more credible evidence for an effect. The strength of evidence in a single study would still be subject to random sampling error, but it would no longer be subject to systematic bias. Therefore, the evidence would be more likely to reveal a true effect and it would be less like to be a false positive.   This implies that z-scores should be interpreted in the context of other information about the likelihood of selection bias. For example, a z-score of 2.5 in a pre-registered study provides stronger evidence for an effect than the same z-score in a study where researchers may have had a chance to conduct multiple studies and to select the most favorable results for publication.

The same logic can also be applied to journals and labs. A z-score of 2.5 in a journal with an average z-score of 2.3 is less trustworthy than a z-score of 2.5 in a journal with an average z-score of 3.5. In the former journal, a z-score of 2.5 is likely to be inflated, whereas in the latter journal a z-score of 2.5 is more likely to be negatively biased by sampling error. For example, currently a z-score of 2.5 is more likely to reveal a true effect if it is published in a cognitive journal than a social journal (see ranking of psychology journals).

The same logic applies even more strongly to labs because labs have a distinct research culture (MO). Some labs conduct many underpowered studies and publish only the studies that worked. Other labs may conduct fewer studies with high power. A z-score of 2.5 is more trustworthy if it comes from a lab with high average power than from a lab with low average power. Thus, providing information about the post-hoc-power of individual researchers can help readers to evaluate the strength of evidence of individual studies in the context of the typical strength of evidence that is obtained in a specific lab. This will create an incentive to publish results with strong evidence rather than fishing for significant results because a low replicability index increases the criterion at which results from a lab provide evidence for an effect.

Advertisements

The Replicability of Cognitive Psychology in the OSF-Reproducibility-Project

The OSF-Reproducibility Project (Psychology) aimed to replicate 100 results published in original research articles in three psychology journals in 2008. The selected journals focus on publishing results from experimental psychology. The main paradigm of experimental psychology is to recruit samples of participants and to study their behaviors in controlled laboratory conditions. The results are then generalized to the typical behavior of the average person.

An important methodological distinction in experimental psychology is the research design. In a within-subject design, participants are exposed to several (a minimum of two) situations and the question of interest is whether responses to one situation differ from behavior in other situations. The advantage of this design is that individuals serve as their own controls and variation due to unobserved causes (mood, personality, etc.) does not influence the results. This design can produce high statistical power to study even small effects. The design is often used by cognitive psychologists because the actual behaviors are often simple behaviors (e.g., pressing a button) that can be repeated many times (e.g., to demonstrate interference in the Stroop paradigm).

In a between-subject design, participants are randomly assigned to different conditions. A mean difference between conditions reveals that the experimental manipulation influenced behavior. The advantage of this design is that behavior is not influenced by previous behaviors in the experiment (carry over effects). The disadvantage is that many uncontrolled factors (e..g, mood, personality) also influence behavior. As a result, it can be difficult to detect small effects of an experimental manipulation among all of the other variance that is caused by uncontrolled factors. As a result, between-subject designs require large samples to study small effects or they can only be used to study large effects.

One of the main findings of the OSF-Reproducibility Project was that results from within-subject designs used by cognitive psychology were more likely to replicate than results from between-subject designs used by social psychologists. There were two few between-subject studies by cognitive psychologists or within-subject designs by social psychologists to separate these factors.   This result of the OSF-reproducibility project was predicted by PHP-curves of the actual articles as well as PHP-curves of cognitive and social journals (Replicability-Rankings).

Given the reliable difference between disciplines within psychology, it seems problematic to generalize the results of the OSF-reproducibility project across all areas of psychology. For this reason, I conducted separate analyses for social psychology and for cognitive psychology. This post examines the replicability of results in cognitive psychology. The results for social psychology are posted here.

The master data file of the OSF-reproducibilty project contained 167 studies with replication results for 99 studies. 42 replications were classified as cognitive studies. I excluded Reynolds and Bresner was excluded because the original finding was not significant. I excluded C Janiszewski, D Uy (doi:10.1111/j.1467-9280.2008.02057.x) because it examined the anchor effect, which I consider to be social psychology. Finally, I excluded two studies with children as participants because this research falls into developmental psychology (E Nurmsoo, P Bloom; V Lobue, JS DeLoache).

I first conducted a post-hoc-power analysis of the reported original results. Test statistics were first converted into two-tailed p-values and two-tailed p-values were converted into absolute z-scores using the formula (1 – norm.inverse(1-p/2). Post-hoc power was estimated by fitting the observed z-scores to predicted z-scores with a mixed-power model with three parameters (Brunner & Schimmack, in preparation).

Estimated power was 75%. This finding reveals the typical presence of publication bias because the actual success rate of 100% is too high given the power of the studies.  Based on this estimate, one would expect that only 75% of the 38 findings (k = 29) would produce a significant result in a set of 38 exact replication studies with the same design and sample size.

PHP-Curve OSF-REP Cognitive Original Data

The Figure visualizes the discrepancy between observed z-scores and the success rate in the original studies. Evidently, the distribution is truncated and suggests a file-drawer of missing studies with non-significant results. However, the mode of the curve (it’s highest point) is projected to be on the right side of the significance criterion (z = 1.96, p = .05 (two-tailed)), which suggests that more than 50% of results should replicate. Given the absence of reliable data in the range from 0 to 1.96, the data make it impossible to estimate the exact distribution in this region, but the gentle decline of z-scores on the right side of the significance criterion suggests that the file-drawer is relatively small.

Sample sizes of the replication studies were based on power analysis with the reported effect sizes. The problem with this approach is that the reported effect sizes are inflated and provide an inflated estimate of true power. With a true power estimate of 75%, the inflated power estimates were above 80% and often over 90%. As a result, many replication studies used the same sample size and some even used a smaller sample size because the original study appeared to be overpowered (the sample size was much larger than needed). The median sample size for the original studies was 32. The median sample size for the replication studies was N = 32. Changes in sample sizes make it difficult to compare the replication rate of the original studies with those of the replication study. Therefore, I adjusted the z-scores of the replication study to match z-scores that would have been obtained with the original sample size. Based on the post-hoc-power analysis above, I predicted that 75% of the replication studies would produce a significant result (k = 29). I also had posted predictions for individual studies based on a more comprehensive assessment of each article. The success rate for my a priori predictions was 69% (k = 27).

The actual replication rate based on adjusted z-scores was 63% (k = 22), although 3 studies produced only p-values between .05 and .06 after the adjustment was applied. If these studies were not counted, the success rate would have been 50% (19/38). This finding suggests that post-hoc power analysis overestimates true power by 10% to 25%. However, it is also possible that some of the replication studies failed to reproduce the exact experimental conditions of the original studies, which would lower the probability of obtaining a significant result. Moreover, the number of studies is very small and the discrepancy may simply be due to random sampling error. The important result is that post-hoc power curves correctly predict that the success rate in a replication study will be lower than the actual success rate because it corrects for the effect of publication bias. It also correctly predicts that a substantial number of studies will be successfully replicated, which they were. In comparison, post-hoc power analysis of social psychology predicted only 35% of successful replications and only 8% successfully replicated. Thus, post-hoc power analysis correctly predicts that results in cognitive psychology are more replicable than results in social psychology.

The next figure shows the post-hoc-power curve for the sample-size corrected z-scores of the replication studies.

PHP-Curve OSF-REP Cognitive Adj. Rep. Data

The PHP-Curve estimate of power for z-scores in the range from 0 to 4 is 53% for the heterogeneous model that fits the data better than a homogeneous model. The shape of the distribution suggests that several of the non-significant results are type-II errors; that is, the studies had insufficient statistical power to demonstrate a real effect.

I also conducted a power analysis that was limited to the non-significant results. The estimated average power was 22%. This power is a mixture of true power in different studies and may contain some cases of true false positives (power = .05), but the existing data are insufficient to determine whether results are true false positives or whether a small effect is present and sample sizes were too small to detect it. Again, it is noteworthy that the same analysis for social psychology produced an estimate of 5%, which suggests that most of the non-significant results in social psychology are true false positives (the null-effect is true).

Below I discuss my predictions of individual studies.

Eight studies reported an effect with a z-score greater than 4 (4 sigma), and I predicted that all of the 4-sigma effects would replicate. 7 out of 8 effects were successfully replicated (D Ganor-Stern, J Tzelgov; JI Campbell, ND Robert; M Bassok, SF Pedigo, AT Oskarsson; PA White; E Vul, H Pashler; E Vul, M Nieuwenstein, N Kanwisher; J Winawer, AC Huk, L Boroditsky). The only exception was CP Beaman, I Neath, AM Surprenant (DOI: 10.1037/0278-7393.34.1.219). It is noteworthy that the sample size of the original study was N = 99 and the sample size of the replication study was N = 14. Even with an adjusted z-score the study produced a non-significant result (p = .19). However, small samples produce less reliable results and it would be interesting to examine whether the result would become significant with an actual sample of 99 participants.

Based on more detailed analysis of individual articles, I predicted that an additional 19 studies would replicate. However, 9 out these 19 studies were not successfully replicated. Thus, my predictions of additional successful replications are just at chance level, given the overall success rate of 50%.

Based on more detailed analysis of individual articles, I predicted that 11 studies would not replicate. However, 5 out these 11 studies were successfully replicated. Thus, my predictions of failed replications are just at chance level, given the overall success rate of 50%.

In short, my only rule that successfully predicted replicability of individual studies was the 4-sigma rule that predicts that all findings with a z-score greater than 4 will replicate.

In conclusion, a replicability of 50-60% is consistent with Cohen’s (1962) suggestion that typical studies in psychology have 60% power. Post-hoc power analysis slightly overestimated the replicability of published findings despite its ability to correct for publication bias. Future research needs to examine the sources that lead to a discrepancy between predicted and realized success rate. It is possible that some of this discrepancy is due to moderating factors. Although a replicability of 50-60% is not as catastrophic as the results for social psychology with estimates in the range from 8-35%, cognitive psychologists should aim to increase the replicability of published results. Given the widespread use of powerful within-subject designs, this is easily achieved by a modest increase in sample sizes from currently 30 participants to 50 participants, which would increase power from 60% to 80%.

The Replicability of Social Psychology in the OSF-Reproducibility Project

Abstract:  I predicted the replicability of 38 social psychology results in the OSF-Reproducibility Project. Based on post-hoc-power analysis I predicted a success rate of 35%.  The actual success rate was 8% (3 out of 38) and post-hoc-power was estimated to be 3% for 36 out of 38 studies (5% power = type-I error rate, meaning the null-hypothesis is true).

The OSF-Reproducibility Project aimed to replicate 100 results published in original research articles in three psychology journals in 2008. The selected journals focus on publishing results from experimental psychology. The main paradigm of experimental psychology is to recruit samples of participants and to study their behaviors in controlled laboratory conditions. The results are then generalized to the typical behavior of the average person.

An important methodological distinction in experimental psychology is the research design. In a within-subject design, participants are exposed to several (a minimum of two) situations and the question of interest is whether responses to one situation differ from behavior in other situations. The advantage of this design is that individuals serve as their own controls and variation due to unobserved causes (mood, personality, etc.) does not influence the results. This design can produce high statistical power to study even small effects. The design is often used by cognitive psychologists because the actual behaviors are often simple behaviors (e.g., pressing a button) that can be repeated many times (e.g., to demonstrate interference in the Stroop paradigm).

In a between-subject design, participants are randomly assigned to different conditions. A mean difference between conditions reveals that the experimental manipulation influenced behavior. The advantage of this design is that behavior is not influenced by previous behaviors in the experiment (carry over effects). The disadvantage is that many uncontrolled factors (e..g, mood, personality) also influence behavior. As a result, it can be difficult to detect small effects of an experimental manipulation among all of the other variance that is caused by uncontrolled factors. As a result, between-subject designs require large samples to study small effects or they can only be used to study large effects.

One of the main findings of the OSF-Reproducibility Project was that results from within-subject designs used by cognitive psychology were more likely to replicate than results from between-subject designs used by social psychologists. There were two few between-subject studies by cognitive psychologists or within-subject designs by social psychologists to separate these factors.   This result of the OSF-reproducibility project was predicted by PHP-curves of the actual articles as well as PHP-curves of cognitive and social journals (Replicability-Rankings).

Given the reliable difference between disciplines within psychology, it seems problematic to generalize the results of the OSF-reproducibility project to all areas of psychology. The Replicability-Rankings suggest that social psychology has a lower replicability than other areas of psychology. For this reason, I conducted separate analyses for social psychology and for cognitive psychology. Other areas of psychology had two few studies to conduct a meaningful analysis. Thus, the OSF-reproducibility results should not be generalized to all areas of psychology.

The master data file of the OSF-reproducibilty project contained 167 studies with replication results for 99 studies.   57 studies were classified as social studies. However, this classification used a broad definition of social psychology that included personality psychology and developmental psychology. It included six articles published in the personality section of the Journal of Personality and Social Psychology. As each section functions essentially like an independent journal, I excluded all studies from this section. The file also contained two independent replications of two experiments (experiment 5 and 7) in Albarracín et al. (2008; DOI: 10.1037/a0012833). As the main sampling strategy was to select the last study of each article, I only included Study 7 in the analysis (Study 5 did not replicate, p = .77). Thus, my selection did not lower the rate of successful replications. There were also two independent replications of the same result in Bressan and Stranieri (2008). Both replications produced non-significant results (p = .63, p = .75). I selected the replication study with the larger sample (N = 318 vs. 259). I also excluded two studies that were not independent replications. Rule and Ambady (2008) examined the correlation between facial features and success of CEOs. The replication study had new raters to rate the faces, but used the same faces. Heine, Buchtel, and Norenzayan (2008) examined correlates of conscientiousness across nations and the replication study examined the same relationship across the same set of nations. I also excluded replications of non-significant results because non-significant results provide ambiguous information and cannot be interpreted as evidence for the null-hypothesis. For this reason, it is not clear how the results of a replication study should be interpreted. Two underpowered studies could easily produce consistent results that are both type-II errors. For this reason, I excluded Ranganath and Nosek (2008) and Eastwick and Finkel (2008). The final sample consisted of 38 articles.

I first conducted a post-hoc-power analysis of the reported original results. Test statistics were first converted into two-tailed p-values and two-tailed p-values were converted into absolute z-scores using the formula (1 – norm.inverse(1-p/2). Post-hoc power was estimated by fitting the observed z-scores to predicted z-scores with a mixed-power model with three parameters (Brunner & Schimmack, in preparation).

Estimated power was 35%. This finding reflects the typical finding that reported results are a biased sample of studies that produced significant results, whereas non-significant results are not submitted for publication. Based on this estimate, one would expect that only 35% of the 38 findings (k = 13) would produce a significant result in an exact replication study with the same design and sample size.

PHP-Curve OSF-REP-Social-Original

The Figure visualizes the discrepancy between observed z-scores and the success rate in the original studies. Evidently, the distribution is truncated and the mode of the curve (it’s highest point) is projected to be on the left side of the significance criterion (z = 1.96, p = .05 (two-tailed)). Given the absence of reliable data in the range from 0 to 1.96, the data make it impossible to estimate the exact distribution in this region, but the step decline of z-scores on the right side of the significance criterion suggests that many of the significant results achieved significance only with the help of inflated observed effect sizes. As sampling error is random, these results will not replicate again in a replication study.

The replication studies had different sample sizes than the original studies. This makes it difficult to compare the prediction to the actual success rate because the actual success rate could be much higher if the replication studies had much larger samples and more power to replicate effects. For example, if all replication studies had sample sizes of N = 1,000, we would expect a much higher replication rate than 35%. The median sample size of the original studies was N = 86. This is representative of studies in social psychology. The median sample size of the replication studies was N = 120. Given this increase in power, the predicted success rate would increase to 50%. However, the increase in power was not uniform across studies. Therefore, I used the p-values and sample size of the replication study to compute the z-score that would have been obtained with the original sample size and I used these results to compare the predicted success rate to the actual success rate in the OSF-reproducibility project.

The depressing finding was that the actual success rate was much lower than the predicted success rate. Only 3 out of 38 results (8%) produced a significant result (without the correction of sample size 5 findings would have been significant). Even more depressing is the fact that a 5% criterion, implies that every 20 studies are expected to produce a significant result just by chance. Thus, the actual success rate is close to the success rate that would be expected if all of the original results were false positives. A success rate of 8% would imply that the actual power of the replication studies was only 8%, compared to the predicted power of 35%.

The next figure shows the post-hoc-power curve for the sample-size corrected z-scores.

PHP-Curve OSF-REP-Social-AdjRep

The PHP-Curve estimate of power for z-scores in the range from 0 to 4 is 3% for the homogeneous case. This finding means that the distribution of z-scores for 36 of the 38 results is consistent with the null-hypothesis that the true effect size for these effects is zero. Only two z-scores greater than 4 (one shown, the other greater than 6 not shown) appear to be replicable and robust effects.

One replicable finding was obtained in a study by Halevy, Bornstein, and Sagiv. The authors demonstrated that allocation of money to in-group and out-group members is influenced much more by favoring the in-group than by punishing the out-group. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.

The other successful replication was a study by Lemay and Clark (DOI: 10.1037/0022-3514.94.4.647). The replicated finding was that participants’ projected their own responsiveness in a romantic relationship onto their partners’ responsiveness while controlling for partners’ actual responsiveness. Given the strong effect in the original study (z > 4), I had predicted that this finding would replicate.

Based on weak statistical evidence in the original studies, I had predicted failures of replication for 25 studies. Given the low success rate, it is not surprising that my success rate was 100.

I made the wrong prediction for 11 results. In all cases, I predicted a successful replication when the outcome was a failed replication. Thus, my overall success rate was 27/38 = 71%. Unfortunately, this success rate is easily beaten by a simple prediction rule that nothing in social psychology replicates, which is wrong in only 3 out of 38 predictions (89% success rate).

Below I briefly comment on the 11 failed predictions.

1   Based on strong statistics (z > 4), I had predicted a successful replication for Förster, Liberman, and Kuschel (DOI: 10.1037/0022-3514.94.4.579). However, even when I made this predictions based on the reported statistics, I had my doubts about this study because statisticians had discovered anomalies in Jens Förster’s studies that cast doubt on the validity of these reported results. Post-hoc power analysis can correct for publication bias, but it cannot correct for other sources of bias that lead to vastly inflated effect sizes.

2   I predicted a successful replication of Payne, MA Burkley, MB Stokes. The replication study actually produced a significant result, but it was no longer significant after correcting for the larger sample size in the replication study (180 vs. 70, p = .045 vs. .21). Although the p-value in the replication study is not very reassuring, it is possible that this is a real effect. However, the original result was probably still inflated by sampling error to produce a z-score of 2.97.

3   I predicted a successful replication of McCrae (DOI: 10.1037/0022-3514.95.2.274). This prediction was based on a transcription error. Whereas the z-score for the target effect was 1.80, I posted a z-score of 3.5. Ironically, the study did successfully replicate with a larger sample size, but the effect was no longer significant after adjusting the result for sample size (N = 61 vs. N = 28). This study demonstrates that marginally significant effects can reveal real effects, but it also shows that larger samples are needed in replication studies to demonstrate this.

4   I predicted a successful replication for EP Lemay, MS Clark (DOI: 10.1037/0022-3514.95.2.420). This prediction was based on a transcription error because EP Lemay and MS Clark had another study in the project. With the correct z-score of the original result (z = 2.27), I would have predicted correctly that the result would not replicate.

5  I predicted a successful replication of Monin, Sawyer, and Marquez (DOI: 10.1037/0022-3514.95.1.76) based on a strong result for the target effect (z = 3.8). The replication study produced a z-score of 1.45 with a sample size that was not much larger than the original study (N = 75 vs. 67).

6  I predicted a successful replication for Shnabel and Nadler (DOI: 10.1037/0022-3514.94.1.116). The replication study increased sample size by 50% (Ns = 141 vs. 94), but the effect in the replication study was modest (z = 1.19).

7  I predicted a successful replication for van Dijk, van Kleef, Steinel, van Beest (DOI: 10.1037/0022-3514.94.4.600). The sample size in the replication study was slightly smaller than in the original study (N = 83 vs. 103), but even with adjustment the effect was close to zero (z = 0.28).

8   I predicted a successful replication of V Purdie-Vaughns, CM Steele, PG Davies, R Ditlmann, JR Crosby (DOI: 10.1037/0022-3514.94.4.615). The original study had rather strong evidence (z = 3.35). In this case, the replication study had a much larger sample than the original study (N = 1,490 vs. 90) and still did not produce a significant result.

9  I predicted a successful replication of C Farris, TA Treat, RJ Viken, RM McFall (doi:10.1111/j.1467-9280.2008.02092.x). The replication study had a somewhat smaller sample (N = 144 vs. 280), but even with adjustment of sample size the effect in the replication study was close to zero (z = 0.03).

10   I predicted a successful replication of KD Vohs and JW Schooler (doi:10.1111/j.1467-9280.2008.02045.x)). I made this prediction of generally strong statistics, although the strength of the target effect was below 3 (z = 2.8) and the sample size was small (N = 30). The replication study doubled the sample size (N = 58), but produced weak evidence (z = 1.08). However, even the sample size of the replication study is modest and does not allow strong conclusions about the existence of the effect.

11   I predicted a successful replication of Blankenship and Wegener (DOI: 10.1037/0022-3514.94.2.94.2.196). The article reported strong statistics and the z-score for the target effect was greater than 3 (z = 3.36). The study also had a large sample size (N = 261). The replication study also had a similarly large sample size (N = 251), but the effect was much smaller than in the original study (z = 3.36 vs. 0.70).

In some of these failed predictions it is possible that the replication study failed to reproduce the same experimental conditions or that the population of the replication study differs from the population of the original study. However, there are twice as many studies where the failure of replication was predicted based on weak statistical evidence and the presence of publication bias in social psychology journals.

In conclusion, this set of results from a representative sample of articles in social psychology reported a 100% success rate. It is well known that this success rate can only be achieved with selective reporting of significant results. Even the inflated estimate of median observed power is only 71%, which shows that the success rate of 100% is inflated. A power estimate that corrects for inflation suggested that only 35% of results would replicate, and the actual success rate is only 8%. While mistakes by the replication experimenters may contribute to the discrepancy between the prediction of 35% and the actual success rate of 8%, it was predictable based on the results in the original studies that the majority of results would not replicate in replication studies with the same sample size as the original studies.

This low success rate is not characteristic of other sciences and other disciplines in psychology. As mentioned earlier, the success rate for cognitive psychology is higher and comparisons of psychological journals show that social psychology journals have lower replicability than other journals. Moreover, an analysis of time trends shows that replicability of social psychology journals has been low for decades and some journals even show a negative trend in the past decade.

The low replicability of social psychology has been known for over 50 years, when Cohen examined the replicability of results published in the Journal of Social and Abnormal Psychology (now Journal of Personality and Social Psychology), the flagship journal of social psychology. Cohen estimated a replicability of 60%. Social psychologists would rejoice if the reproducibility project had shown a replication rate of 60%. The depressing result is that the actual replication rate was 8%.

The main implication of this finding is that it is virtually impossible to trust any results that are being published in social psychology journals. Yes, two articles that posted strong statistics (z > 4) replicated, but several results with equally strong statistics did not replicate. Thus, it is reasonable to distrust all results with z-scores below 4 (4 sigma rule), but not all results with z-scores greater than 4 will replicate.

Given the low credibility of original research findings, it will be important to raise the quality of social psychology by increasing statistical power. It will also be important to allow publication of non-significant results to reduce the distortion that is created by a file-drawer filled with failed studies. Finally, it will be important to use stronger methods of bias-correction in meta-analysis because traditional meta-analysis seemed to show strong evidence even for incredible effects like premonition for erotic stimuli (Bem, 2011).

In conclusion, the OSF-project demonstrated convincingly that many published results in social psychology cannot be replicated. If social psychology wants to be taken seriously as a science, it has to change the way data are collected, analyzed, and reported and demonstrate replicability in a new test of reproducibility.

The silver lining is that a replication rate of 8% is likely to be an underestimation and that regression to the mean alone might lead to some improvement in the next evaluation of social psychology.

Which Social Psychology Results Were Successfully Replicated in the OSF-Reproducibility Project? Recommeding a 4-Sigma Rule

After several years and many hours of hard work by hundreds of psychologists, the results of the OSF-Reproducibility project are in. The project aimed to replicate a representative set of 100 studies from top journals in social and cognitive psychology. The replication studies aimed to reproduce the original studies as closely as possible, while increasing sample sizes somewhat to reduce the risk of type-II errors (failure to replicate a true effect).

The results have been widely publicized in the media. On average, only 36% of studies were successfully replicated; that is, the replication study reproduced a significant result. More detailed analysis shows that results from cognitive psychology had a higher success rate (50%) than results from social psychology (25%).

This post describes the 9 results from social psychology that were successfully replicated. 6 out of the 9 successfully replicated studies reported highly significant results with a z-score greater than 4 sigma (standard deviations) from 0 (p < .00003). Particle physics uses a 5-sigma rule to avoid false positives and industry has adopted a 6-sigma rule in quality control.

Based on my analysis of the OSF-results, I recommend a 4-sigma rule for textbook writers, journalists, and other consumers of scientific findings in social psychology to avoid dissemination of false information.

List of Studies in Decreasing Order of Strength of Evidence

1. Single Study, Self-Report, Between-Subject Analysis, Extremely large sample (N = 230,047), Highly Significant Result (z > 4 sigma)

CJ Soto, OP John, SD Gosling, J Potter (2008). The developmental psychometrics of big five self-reports: Acquiescence, factor structure, coherence, and differentiation from ages 10 to 20, JPSP-PPID.

This article reported results of a psychometric analysis of self-reports of personality traits in a very large sample (N = 230,047). The replication study used the exact same method with participants from the same population (N = 455,326). Not surprisingly, the results were replicated. Unfortunately, it is not an option to conduct all studies with huge samples like this one.

2.  4 Studies, Self-Report, Large Sample (N = 211), One-Sample Test, Highly Significant Result (z > 4 sigma)

JL Tracy, RW Robins. (2008). The nonverbal expression of pride: Evidence for cross-cultural recognition. JPSP;PPID.

The replication project focussed on the main effect in Study 4. The main effect in question was whether raters (N = 211) would accurately recognize non-verbal displays of pride in six pictures that displayed pride. The recognition rates were high (range 70%–87%) and highly significant. The sample size of N = 211 is large for a one-sample test that compares a sample mean against a fixed value.

3. Five Studies, Self-Report, Moderate Sample Size (N = 153), Correlation, Highly Significant Result (z > 4 sigma)

EP Lemay, MS Clark (2008). How the head liberates the heart: Projection of communal responsiveness guides relationship promotion. JPSP:IRGP.

Study 5 examined accuracy and biases in perceptions of responsiveness (caring and support for a partner). Participants (N = 153) rated their own responsiveness and how responsive their partner was. Ratings of perceived responsiveness were regressed on self-ratings of responsiveness and targets’ self-ratings of responsiveness. The results revealed a highly significant projection effect; that is, perceptions of responsiveness were predicted by self-ratings of responsiveness. This study produced a highly significant result despite a moderate sample size because the effect size was large.

4. Single Study, Behavior, Moderate Sample (N = 240), Highly Significant Result (z > 4 sigma)

N Halevy, G Bornstein, L Sagiv (2008). In-Group-Love and Out-Group-Hate as Motives for Individual Participation in Intergroup Conflict: A New Game Paradigm, Psychological Science.

This study had a sample size of N = 240. Participants were recruited in groups of six. The experiment had four conditions. The main dependent variable was how a monetary reward was allocated. One manipulation was that some groups had the option to allocate money to the in-group whereas others did not have this option. Naturally, the percentages of allocation to the in-group differed across these conditions. Another manipulation allowed some group-members to communicate whereas in the other condition players had to make decisions on their own. This study produced a highly significant interaction between the two experimental manipulations that was successfully replicated.

5. Single Study, Self-Report, Large Sample (N = 82), Within-Subject Analysis, Highly Significant Result (z > 4 sigma)

M Tamir, C Mitchell, JJ Gross (2008). Hedonic and instrumental motives in anger regulation. Psychological Science.

In this study, 82 participants were asked to imagine being in two types of situations; either scenarios with a hypothetical confrontation or scenarios without a confrontation. They also listened to music that was designed to elicit an excited, angry, or neutral mood. Afterwards participants rated how much they would like to listen to the music they heard if they were in the hypothetical situation. Each participant listened to all pairings of situation and music and the data were analyzed within-subject. A sample size of 82 is large for within-subject designs. A highly significant interaction revealed that a preference for angry music in confrontations and a dislike of angry music without a confrontation that was successfully replicated. A sample of 82 participants is large for a within-subject comparison of means for different conditions.

6. Single Study, Self-Report, Large Sample (N = 124), One-Sample Test, Highly-Significant Result (z > 4 sigma)

DA Armor, C Massey, AM Sackett (2008). Prescribed optimism: Is it right to be wrong about the future? Psychological Science.

In this study, participants (N = 124) were asked to read 8 vignettes that involved making decisions. Participants were asked to judge whether they would recommend making pessimistic, realistic, or optimistic predictions. The main finding was that the average recommendation was to be optimistic. The effect was highly significant. A sample of N = 124 is very large for a design that compares a sample mean to a fixed value.

7. Four Studies, Self-Report, Small Sample (N = 71), Experiment, Moderate Support (z = 2.97)

BK Payne, MA Burkley, MB Stokes (2008). Why do implicit and explicit attitude tests diverge? The role of structural fit. JPSP:ASC.

In this study, participants worked on the standard Affect Misattribution Paradigm (AMP). In the AMP, two stimuli are presented in brief succession. In this study, the first stimulus was a picture of a European or African American face. The second stimulus was a picture of a Chinese pictogram. In the standard paradigm, participants are asked to report how much they like the second stimulus (Chinese pictogram) and to ignore the first stimulus (Black or White face). The AMP is typically used to measure racial attitudes because racial attitudes can influences responses to the Chinese characters.

In this study, the standard AMP was modified by giving two different sets of instructions. One instruction was the standard instructions to respond to the Chinese pictograms. The other instruction was to respond directly to the faces.   All participants (N = 71) completed both tasks. The participants were randomly assigned to two conditions. One condition made it easier to honestly report prejudice (low social pressure). The other condition emphasized that prejudice is socially undesirable (high social pressure). The results showed a significantly stronger correlation between the two tasks (ratings of Chines pictographs & faces) in the low social pressure condition than in the high social pressure condition, which was replicated in the replication study.

8. Two Studies, Self-Report, moderate sample (N = 119), Correlation, Weak Support (z = 2.27)

JT Larsen, AR McKibban (2008). Is happiness having what you want, wanting what you have, or both? Psychological Science.

In this study, participants (N = 124) received a list of 62 material items and were asked to check whether they had the item or not (e.g., a cell phone). They then rated how much they wanted each item. Based on these responses, the authors computed measures of (a) how much participants’ wanted what they have and (b) have what they wanted. The main finding was that life-satisfaction was significantly predicted by wanting what one has while controlling for having what one wants.   This finding was also found in Study 1 (N = 124) and successfully replicated in the OSF-project with a larger sample (N = 238).

9. Five Studies, Behavior, Small Sample (N = 28), Main Effect, Very Weak Support (z = 1.80)

SM McCrea (2008). Self-handicapping, excuse making, and counterfactual thinking: Consequences for self-esteem and future motivation. JPSP:ASC.

In this study, all participants (N = 28) first worked on a math task that was very difficult and participants received failure feedback.   Participants were then randomly assigned to two groups. One group was given feedback that they had insufficient practice (self-handicap). The control group was not given an explanation for their failure. All participants then worked again on a second math task. The main effect showed that performance on the second task was better (higher percentage of correct answers) in the control group than in the self-handicap condition. Although this difference was only marginally significant (p < .05, one-tailed) in the original study, it was significant in the replication study with a larger sample (N = 61).

Although the percentage of correct answers showed only a marginally significant effect, the number of attempted answers and the absolute number of correct answers showed significant effects. Thus, this study does not count as a publication of a null-result. Moreover, these results suggest that participants in the control group were more motivated to do well because they worked on more problems and got more correct answers.

REPLICABILITY RANKING OF 26 PSYCHOLOGY JOURNALS

THEORETICAL BACKGROUND

Neyman & Pearson (1933) developed the theory of type-I and type-II errors in statistical hypothesis testing.

A type-I error is defined as the probability of rejecting the null-hypothesis (i.e., the effect size is zero) when the null-hypothesis is true.

A type-II error is defined as the probability of failing to reject the null-hypothesis when the null-hypothesis is false (i.e., there is an effect).

A common application of statistics is to provide empirical evidence for a theoretically predicted relationship between two variables (cause-effect or covariation). The results of an empirical study can produce two outcomes. Either the result is statistically significant or it is not statistically significant. Statistically significant results are interpreted as support for a theoretically predicted effect.

Statistically non-significant results are difficult to interpret because the prediction may be false (the null-hypothesis is true) or a type-II error occurred (the theoretical prediction is correct, but the results fail to provide sufficient evidence for it).

To avoid type-II errors, researchers can design studies that reduce the type-II error probability. The probability of avoiding a type-II error when a predicted effect exists is called power. It could also be called the probability of success because a significant result can be used to provide empirical support for a hypothesis.

Ideally researchers would want to maximize power to avoid type-II errors. However, powerful studies require more resources. Thus, researchers face a trade-off between the allocation of resources and their probability to obtain a statistically significant result.

Jacob Cohen dedicated a large portion of his career to help researchers with the task of planning studies that can produce a successful result, if the theoretical prediction is true. He suggested that researchers should plan studies to have 80% power. With 80% power, the type-II error rate is still 20%, which means that 1 out of 5 studies in which a theoretical prediction is true would fail to produce a statistically significant result.

Cohen (1962) examined the typical effect sizes in psychology and found that the typical effect size for the mean difference between two groups (e.g., men and women or experimental vs. control group) is about half-of a standard deviation. The standardized effect size measure is called Cohen’s d in his honor. Based on his review of the literature, Cohen suggested that an effect size of d = .2 is small, d = .5 moderate, and d = .8. Importantly, a statistically small effect size can have huge practical importance. Thus, these labels should not be used to make claims about the practical importance of effects. The main purpose of these labels is that researchers can better plan their studies. If researchers expect a large effect (d = .8), they need a relatively small sample to have high power. If researchers expect a small effect (d = .2), they need a large sample to have high power.   Cohen (1992) provided information about effect sizes and sample sizes for different statistical tests (chi-square, correlation, ANOVA, etc.).

Cohen (1962) conducted a meta-analysis of studies published in a prominent psychology journal. Based on the typical effect size and sample size in these studies, Cohen estimated that the average power in studies is about 60%. Importantly, this also means that the typical power to detect small effects is less than 60%. Thus, many studies in psychology have low power and a high type-II error probability. As a result, one would expect that journals often report that studies failed to support theoretical predictions. However, the success rate in psychological journals is over 90% (Sterling, 1959; Sterling, Rosenbaum, & Weinkam, 1995). There are two explanations for discrepancies between the reported success rate and the success probability (power) in psychology. One explanation is that researchers conduct multiple studies and only report successful studies. The other studies remain unreported in a proverbial file-drawer (Rosenthal, 1979). The other explanation is that researchers use questionable research practices to produce significant results in a study (John, Loewenstein, & Prelec, 2012). Both practices have undesirable consequences for the credibility and replicability of published results in psychological journals.

A simple solution to the problem would be to increase the statistical power of studies. If the power of psychological studies in psychology were over 90%, a success rate of 90% would be justified by the actual probability of obtaining significant results. However, meta-analysis and method articles have repeatedly pointed out that psychologists do not consider statistical power in the planning of their studies and that studies continue to be underpowered (Maxwell, 2004; Schimmack, 2012; Sedlmeier & Giegerenzer, 1989).

One reason for the persistent neglect of power could be that researchers have no awareness of the typical power of their studies. This could happen because observed power in a single study is an imperfect indicator of true power (Yuan & Maxwell, 2005). If a study produced a significant result, the observed power is at least 50%, even if the true power is only 30%. Even if the null-hypothesis is true, and researchers publish only type-I errors, observed power is dramatically inflated to 62%, when the true power is only 5% (the type-I error rate). Thus, Cohen’s estimate of 60% power is not very reassuring.

Over the past years, Schimmack and Brunner have developed a method to estimate power for sets of studies with heterogeneous designs, sample sizes, and effect sizes. A technical report is in preparation. The basic logic of this approach is to convert results of all statistical tests into z-scores using the one-tailed p-value of a statistical test.  The z-scores provide a common metric for observed statistical results. The standard normal distribution predicts the distribution of observed z-scores for a fixed value of true power.   However, for heterogeneous sets of studies the distribution of z-scores is a mixture of standard normal distributions with different weights attached to various power values. To illustrate this method, the histograms of z-scores below show simulated data with 10,000 observations with varying levels of true power: 20% null-hypotheses being true (5% power), 20% of studies with 33% power, 20% of studies with 50% power, 20% of studies with 66% power, and 20% of studies with 80% power.

RepRankSimulation

The plot shows the distribution of absolute z-scores (there are no negative effect sizes). The plot is limited to z-scores below 6 (N = 99,985 out of 10,000). Z-scores above 6 standard deviations from zero are extremely unlikely to occur by chance. Even with a conservative estimate of effect size (lower bound of 95% confidence interval), observed power is well above 99%. Moreover, quantum physics uses Z = 5 as a criterion to claim success (e.g., discovery of Higgs-Boson Particle). Thus, Z-scores above 6 can be expected to be highly replicable effects.

Z-scores below 1.96 (the vertical dotted red line) are not significant for the standard criterion of (p < .05, two-tailed). These values are excluded from the calculation of power because these results are either not reported or not interpreted as evidence for an effect. It is still important to realize that true power of all experiments would be lower if these studies were included because many of the non-significant results are produced by studies with 33% power. These non-significant results create two problems. Researchers wasted resources on studies with inconclusive results and readers may be tempted to misinterpret these results as evidence that an effect does not exist (e.g., a drug does not have side effects) when an effect is actually present. In practice, it is difficult to estimate power for non-significant results because the size of the file-drawer is difficult to estimate.

It is possible to estimate power for any range of z-scores, but I prefer the range of z-scores from 2 (just significant) to 4. A z-score of 4 has a 95% confidence interval that ranges from 2 to 6. Thus, even if the observed effect size is inflated, there is still a high chance that a replication study would produce a significant result (Z > 2). Thus, all z-scores greater than 4 can be treated as cases with 100% power. The plot also shows that conclusions are unlikely to change by using a wider range of z-scores because most of the significant results correspond to z-scores between 2 and 4 (89%).

The typical power of studies is estimated based on the distribution of z-scores between 2 and 4. A steep decrease from left to right suggests low power. A steep increase suggests high power. If the peak (mode) of the distribution were centered over Z = 2.8, the data would conform to Cohen’s recommendation to have 80% power.

Using the known distribution of power to estimate power in the critical range gives a power estimate of 61%. A simpler model that assumes a fixed power value for all studies produces a slightly inflated estimate of 63%. Although the heterogeneous model is correct, the plot shows that the homogeneous model provides a reasonable approximation when estimates are limited to a narrow range of Z-scores. Thus, I used the homogeneous model to estimate the typical power of significant results reported in psychological journals.

DATA

The results presented below are based on an ongoing project that examines power in psychological journals (see results section for the list of journals included so far). The set of journals does not include journals that primarily publish reviews and meta-analysis or clinical and applied journals. The data analysis is limited to the years from 2009 to 2015 to provide information about the typical power in contemporary research. Results regarding historic trends will be reported in a forthcoming article.

I downloaded pdf files of all articles published in the selected journals and converted the pdf files to text files. I then extracted all t-tests and F-tests that were reported in the text of the results section searching for t(df) or F(df1,df2). All t and F statistics were converted into one-tailed p-values and then converted into z-scores.

RepRankAll

The plot above shows the results based on 218,698 t and F tests reported between 2009 and 2015 in the selected psychology journals. Unlike the simulated data, the plot shows a steep drop for z-scores just below the threshold of significance (z = 1.96). This drop is due to the tendency not to publish or report non-significant results. The heterogeneous model uses the distribution of non-significant results to estimate the size of the file-drawer (unpublished non-significant results). However, for the present purpose the size of the file-drawer is irrelevant because power is estimated only for significant results for Z-scores between 2 and 4.

The green line shows the best fitting estimate for the homogeneous model. The red curve shows fit of the heterogeneous model. The heterogeneous model is doing a much better job at fitting the long tail of highly significant results, but for the critical interval of z-scores between 2 and 4, the two models provide similar estimates of power (55% homogeneous & 53% heterogeneous model).   If the range is extended to z-scores between 2 and 6, power estimates diverge (82% homogenous, 61% heterogeneous). The plot indicates that the heterogeneous model fits the data better and that the 61% estimate is a better estimate of true power for significant results in this range. Thus, the results are in line with Cohen (1962) estimate that psychological studies average 60% power.

REPLICABILITY RANKING

The distribution of z-scores between 2 and 4 was used to estimate the average power separately for each journal. As power is the probability to obtain a significant result, this measure estimates the replicability of results published in a particular journal if researchers would reproduce the studies under identical conditions with the same sample size (exact replication). Thus, even though the selection criterion ensured that all tests produced a significant result (100% success rate), the replication rate is expected to be only about 50%, even if the replication studies successfully reproduce the conditions of the published studies. The table below shows the replicability ranking of the journals, the replicability score, and a grade. Journals are graded based on a scheme that is similar to grading schemes for undergraduate students (below 50 = F, 50-59 = E, 60-69 = D, 70-79 = C, 80-89 = B, 90+ = A).

ReplicabilityRanking

The average value in 2000-2014 is 57 (D+). The average value in 2015 is 58 (D+). The correlation for the values in 2010-2014 and those in 2015 is r = .66.   These findings show that the replicability scores are reliable and that journals differ systematically in the power of published studies.

LIMITATIONS

The main limitation of the method is that focuses on t and F-tests. The results might change when other statistics are included in the analysis. The next goal is to incorporate correlations and regression coefficients.

The second limitation is that the analysis does not discriminate between primary hypothesis tests and secondary analyses. For example, an article may find a significant main effect for gender, but the critical test is whether gender interacts with an experimental manipulation. It is possible that some journals have lower scores because they report more secondary analyses with lower power. To address this issue, it will be necessary to code articles in terms of the importance of statistical test.

The ranking for 2015 is based on the currently available data and may change when more data become available. Readers should also avoid interpreting small differences in replicability scores as these scores are likely to fluctuate. However, the strong correlation over time suggests that there are meaningful differences in the replicability and credibility of published results across journals.

CONCLUSION

This article provides objective information about the replicability of published findings in psychology journals. None of the journals reaches Cohen’s recommended level of 80% replicability. Average replicability is just about 50%. This finding is largely consistent with Cohen’s analysis of power over 50 years ago. The publication of the first replicability analysis by journal should provide an incentive to editors to increase the reputation of their journal by paying more attention to the quality of the published data. In this regard, it is noteworthy that replicability scores diverge from traditional indicators of journal prestige such as impact factors. Ideally, the impact of an empirical article should be aligned with the replicability of the empirical results. Thus, the replicability index may also help researchers to base their own research on credible results that are published in journals with a high replicability score and to avoid incredible results that are published in journals with a low replicability score. Ultimately, I can only hope that journals will start competing with each other for a top spot in the replicability rankings and as a by-product increase the replicability of published findings and the credibility of psychological science.

Using the R-index to detect questionable research practices in SSRI studies

Amna Shakil and Ulrich Schimmack

In 2008, Turner and colleagues (2008) examined the presence of publication bias in clinical trials of antidepressants. They found that out of 74 FDA-registered studies, 51% showed positive results. However, positive results were much more likely to be published, as 94% of the published results were positive. There were two reasons for the inflated percentage of positive results. First, negative results were not published. Second, negative results were published as positive results. Turner and colleagues’ (2008) results received a lot of attention and cast doubt on the effectiveness of anti-depressants.

A year after Turner and colleagues (2008) published their study, Moreno, Sutton, Turner, Abrams, Cooper and Palmer (2009) examined the influence of publication bias on the effect-size estimate in clinical trials of antidepressants. They found no evidence of publication bias in the FDA-registered trials, leading the researchers to conclude that the FDA data provide an unbiased gold standard to examine biases in the published literature.

The effect size for treatment with anti-depressants in the FDA data was g = 0.31, 95% confidence interval 0.27 to 0.35. In contrast, the uncorrected average effect size in the published studies was g = 0.41, 95% confidence interval 0.37 to 0.45. This finding shows that publication bias inflates effect size estimates by 32% ((0.41 – 0.31)/0.31).

Moreno et al. (2009) also used regression analysis to obtain a corrected effect size estimate based on the biased effect sizes in the published literature. In this method, effect sizes are regressed on sampling error under the assumption that studies with smaller samples (and larger sampling error) have more bias. The intercept is used as an estimate of the population effect size when sampling error is zero. This correction method yielded an effect size estimate of g = 0.29, 95% confidence interval 0.23 to 0.35, which is similar to the gold standard estimate (.31).

The main limitation of the regression method is that other factors can produce a correlation between sample size and effect size (e.g., higher quality studies are more costly and use smaller samples). To avoid this problem, we used an alternative correction method that does not make this assumption.

The method uses the R-Index to examine bias in a published data set. The R-Index increases as statistical power increases and it decreases when publication bias is present. To obtain an unbiased effect size estimate, studies are selected to maximize the R-Index.

Since the actual data files were not available, graphs A and B from Moreno et al.’s (2009) study were used to obtain information about effect size and sample error of all the FDA-registered and the published journal articles.

The FDA-registered studies had the success rate of 53% and the observed power of 56%, resulting in an inflation of close to 0. The close match between the success rate and observed confirms FDA studies are not biased. Given the lack of bias (inflation), the most accurate estimate of the effect size is obtained by using all studies.

The published journal articles had a success rate of 86% and the observed power of 73%, resulting in the inflation rate of 12%. The inflation rate of 12% confirms that the published data set is biased. The R-Index subtracts the inflation rate from observed power to correct for inflation. Thus, the R-Index for the published studies is 73-12 = 61. The weighted effect size estimate was d = .40.

The next step was to select sets of studies to maximize the R-Index. As most studies were significant, the success rate could not change much. As a result, most of the increase would be achieved by selecting studies with higher sample sizes in order to increase power. The maximum R-Index was obtained for a cut-off point of N = 225. This left 14 studies with a total sample size of 4,170 participants. The success rate was 100% with median observed power of 85%. The Inflation was still 15%, but the R-Index was higher than it was for the full set of studies (70 vs. 61). The weighted average effect size in the selected set of powerful studies was d = .34. This result is very similar to the gold standard in the FDA data. The small discrepancy can be attributed to the fact that even studies with 85% power still have a small bias in the estimation of the true effect size.

In conclusion, our alternative effect size estimation procedure confirms Moreno et al.’s (2009) results using an alternative bias-correction method and shows that the R-Index can be a valuable tool to detect and correct for publication bias in other meta-analyses.

These results have important practical implications. The R-Index confirms that published clinical trials are biased and can provide false information about the effectiveness of drugs. It is therefore important to ensure that clinical trials are preregistered and that all results of clinical trials are published. The R-Index can be used to detect violations of these practices that lead to biased evidence. Another important finding is that clinical trials of antidepressants do show effectiveness and that antidepressants can be used as effective treatments of depression. The presence of publication bias should not be used to claim that antidepressants lack effectiveness.

References

Moreno, S. G., Sutton, A. J., Turner, E. H., Abrams, K. R., Cooper, N. J., Palmer, T. M., & Ades, A. E. (2009). Novel methods to deal with publication biases: secondary analysis of antidepressant trials in the FDA trial registry database and related journal publications. Bmj, 339, b2981.

Turner, E. H., Matthews, A. M., Linardatos, E., Tell, R. A., & Rosenthal, R. (2008). Selective publication of antidepressant trials and its influence on apparent efficacy. New England Journal of Medicine, 358(3), 252-260.