Category Archives: Uncategorized

P-REP (2005-2009): Reexamining the experiment to replace p-values with the probability of replicating an effect

In 2005, Psychological Science published an article titled “An Alternative to Null-Hypothesis Significance Tests” by Peter R. Killeen.    The article proposed to replace p-values and significance testing with a new statistic; the probability of replicating an effect (P-rep).  The article generated a lot of excitement and for a period from 2006 to 2009, Psychological Science encouraged reporting p-rep.   After some statistical criticism and after a new editor took over Psychological Science, interest in p-rep declined (see Figure).

It is ironic that only a few years later, psychological science would encounter a replication crisis, where several famous experiments did not replicate.  Despite much discussion about replicability of psychological science in recent years, Killeen’s attempt to predict replication outcome has been hardly mentioned.  This blog post reexamines p-rep in the context of the current replication crisis.

The abstract clearly defines p-rep as an estimate of “the probability of replicating an effect” (p. 345), which is the core meaning of replicability. Factories have high replicability (6 sigma) and produce virtually identical products that work with high probability. However, in empirical research it is not so easy to define what it means to get the same result. So, the first step in estimating replicability is to define the result of a study that a replication study aims to replicate.

“Traditionally, replication has been viewed as a second successful attainment of a significant effect” (Killeen, 2005, p. 349). Viewed from this perspective, p-rep would estimate the probability of obtaining a significant result (p < alpha) after observing a significant result in an original study.

Killeen proposes to change the criterion to the sign of the observed effect size. This implies that p-rep can only be applied to hypothesis with a directional hypothesis (e.g, it does not apply to tests of explained variance).  The criterion for a successful replication then becomes observing an effect size with the same sign as the original study.

Although this may appear like a radical change from null-hypothesis significance testing, this is not the case.  We can translate the sign criterion into an alpha level of 50% in a one-tailed t-test.  For a one-tailed t-test, negative effect sizes have p-values ranging from 1 to .50 and positive effect sizes have p-values ranging from .50 to 0.  So, a successful outcome is associated with a p-value below .50 (p < .50).

If we observe a positive effect size in the original study, we can compute the power of obtaining a positive result in a replicating study with a post-hoc power analysis, where we enter information about the standardized effect size, sample size, and alpha = .50, one-tailed.

Using R syntax this can be achieved with the formula:

Pt(obs.es/se,2-N)

with obs.es being the observed standardized effect size (Cohen’s d), N = total sample size, and se = sampling error = 2/sqrt(N).

The similarity to p-rep is apparent, when we look at the formula for p-rep.

Pnorm(obs.es/se/sqrt(2))

There are two differences. First, p-rep uses the standard normal distribution to estimate power. This is a simplification that ignores the degrees of freedom.  The more accurate formula for power is the non-central t-distribution that takes the degrees of freedom (N-2) into account.  However, even with modest sample sizes of N  =40, this simplification has negligible effects on power estimates.

The second difference is that p-rep reduces the non-centrality parameter (effect size/sampling error) by a factor of square-root 2.  Without going into the complex reasoning behind this adjustment, the end-result of the adjustment is that p-rep will be lower than the standard power estimate.

Using Killeen’s example on page 347 with d = .5 and N = 20, p-rep = .785.  In contrast, the power estimate with alpha = .50 is .861.

The comparison of p-rep with standard power analysis brings up an interesting and unexplored question. “Does p-rep really predict the probability of replication?”  (p. 348).  Killeen (2005) uses meta-analyses to answer this question.  In one example, he found that 70% of studies showed a negative relation between heart rate and aggressive behaviors.  The median value of p-rep over those studies was 71%.  Two other examples are provided.

A better way to evaluate estimates of replicability is to conduct simulation studies where the true answer is known.  For example, a simulation study can simulate 1,000,000 exact replications of Killeen’s example with d = .5 and N = 20 and we can observe how many studies show a positive observed effect size.  In a single run of this simulation, 86,842 studies showed a positive sign. Median P-rep (.788) underestimates this actual success rate, whereas median observed power more closely predicts the observed success rate (.861).

This is not surprising.  Power analysis is designed to predict the long-term success rate given a population effect size, a criterion value, and sampling error.  The adjustment made by Killeen is unnecessary and leads to the wrong prediction.

P-rep applied to Single Studies

It is also peculiar to use meta-analyses to test the performance of p-rep because a meta-analysis implies that many studies have been conducted, whereas the goal of p-rep was to predict the outcome of a single replication study from the outcome of an original study.

This primary aim also explains the adjustment to the non-centrality parameter, which was based on the idea to add the sampling variances of the original and replication study.  Finally, Killeen clearly states that the goal of p-rep is to ignore population effect sizes and to define replicability as “an effect of the same sign as that found in the original experiment” (p. 346).  This is very different from power analysis, which estimates the probability of an effect of the same sign as the population effect size.

We can evaluate p-rep as a predictor of obtaining effect sizes with the same direction in two studies with another simulation study.  Assume that the effect size is d = .20 and the total sample size is also small (N = 20).  The median p-rep estimate is 62%.

The 2 x 2 table shows how often the effect sizes of the original study and the replication study match.

Negative Positive
Negative 11% 22%
Positive 22% 45%

The table shows that the original and replication study match only 45% of the time when the sign also matches the population effect size. Another 11% matches occur when the original and the replication study show the wrong sign and future replication studies are more likely to show the opposite effect size.  Although these cases meet the definition of replicability with the sign of the original study as criterion, it seems questionable to define a pair of studies that both show the wrong result as a successful replication.  Furthermore, the median p-rep estimate of 62% is inconsistent with the correctly matched cases (45%) or the total number of matched cases (45% + 11% = 56%).  In conclusion, it is neither sensible to define replicability as consistency between pairs of exact replication studies, nor does p-rep estimate this probability very well.

Can we fix it?

The previous examination of p-rep showed that it is essentially an observed power estimate with alpha = 50% and an attenuated non-centrality parameter.  Does this mean we can fix p-rep and turn it into a meaningful statistic?  In other words, is it meaningful to compute the probability that future replication studies will reveal the direction of the population effect size by computing power with alpha = 50%?

For example, a research finds an effect size of d = .4 with a total sample size of N = 100.  Using a standard t-test, the research can report the traditional p-value; p = .048.

Negative Positive
Negative 0% 2%
Positive 2% 96%

The simulation results show that the most observations show consistent signs in pairs of studies and are also consistent with the population effect size.  Median observed power, the new p-rep, is 98%. So, is a high p-rep value a good indicator that future studies will also produce a positive sign?

The main problem with observed power analysis is that it relies on the observed effect size as an estimate of the population effect size.  However, in small samples, the difference between observed effect sizes and population effect sizes can be large, which leads to very variable estimates of p-rep. One way to alert readers to the variability in replicability estimates is to provide a confidence interval around the estimate.  As p-rep is a function of the observed effect size, this is easily achieved by converting the lower and upper limit of the confidence interval around the effect size into a confidence interval for p-rep.  With d = .4 and N = 100 (sampling error = 2/sqrt(100) = .20), the confidence interval of effect sizes ranges from d = .008 to d = .792.  The corresponding p-rep values are 52% to 100%.

Importantly, a value of 50% is the lower bound for p-rep and corresponds to determining the direction of the effect by a coin toss.  In other words, the point estimate of replicability can be highly misleading because the observed effect size may be considerably lower than the population effect size.   This means that reporting the point-estimate of p-rep can give false assurance about replicability, while the confidence interval shows that there is tremendous uncertainty around this estimate.

Understanding Replication Failures

Killeen (2005) pointed out that it can be difficult to understand replication failures using the traditional criterion of obtaining a significant result in the replication study.  For example, the original study may have reported a significant result with p = .04 and the replication study produced a non-significant p-value of p = .06.  According to the criterion of obtaining a significant result in the replication study, this outcome is a disappointing failure.  Of course, there is no meaningful difference between p = .04 and p = .06. It just so happens that they are on opposite sides of an arbitrary criterion value.

Killeen suggests that we can avoid this problem by reporting p-rep.  However, p-rep just changes the arbitrary criterion value from p = .05 to d = 0.  It is still possible that a replication study will fail because the effect sizes do not match.  Whereas the effect size in an original study was d = .05, the effect size in the replication study was d = -.05.  In small samples, this is not a meaningful difference in effect sizes, but the outcome constitutes a replication failure.

There is simply no way around making mistakes in inferential statistics.  We can only try to minimize them at the expense of reducing sampling error.  By setting alpha to 50%, we are reducing type-II errors (failing to support a correct hypothesis) at the expense of increasing the risk of a type-I error (failing to accept the wrong hypothesis), but errors will be made.

P-rep and Publication Bias

Killeen (2005) points out another limitation of p-rep.  “One might, of course, be misled by a value of prep that itself cannot be replicated. This can be caused by publication bias against small or negative effects.” (p. 350).  Here we see the real problem of raising alpha to 50%.  If there is no effect (d = 0), one out of two studies will produce a positive result that can be published.  If 100 researchers test an interesting hypothesis in their lab, but only positive results will be published, approximately 50 articles will support a false conclusion, while 50 other articles that showed the opposite result will be hidden in file drawers.  A stricter alpha criterion is needed to minimize the rate of false inferences, especially when publication bias is present.

A counter-argument could be that researchers who find a negative result can also publish their results, because positive and negative results are equally publishable. However, this would imply that journals are filled with inconsistent results and research areas with small effects and small samples will publish nearly equal number of studies with positive and negative results. Each article would draw a conclusion based on the results of a single study and try to explain inconsistent with potential moderator variables.  By imposing a stricter criterion for sufficient evidence, published results are more consistent and more likely to reflect a true finding.  This is especially true, if studies have sufficient power to reduce the risk of type-II errors and if journals do not selectively report studies with positive results.

Does this mean estimating replicability is a bad idea?

Although Killeen’s (2005) main goal was to predict the outcome of a single replication study, he did explore how well median replicability estimates predicted the outcome of meta-analysis.  As aggregation across studies reduces sampling error, replicability estimates based on sets of studies can be useful to predict actual success rates in studies (Sterling et al., 1995).  The comparison of median observed power with actual success rates can be used to reveal publication bias (Schimmack, 2012) and median observed power is a valid predictor of future study outcomes in the absence of publication bias and for homogeneous sets of studies. More advanced methods even make it possible to estimate replicability when publication bias is present and when the set of studies is heterogenous (Brunner & Schimmack, 2016).  So, while p-rep has a number of shortcomings, the idea of estimating replicability deserves further attention.

Conclusion

The rise and fall of p-rep in the first decade of the 2000s tells an interesting story about psychological science.  In hindsight, the popularity of p-rep is consistent with an area that focused more on discoveries than on error control.  Ideally, every study, no matter how small, would be sufficient to support inferences about human behavior.  The criterion to produce a p-value below .05 was deemed an “unfortunate historical commitment to significance testing” (p. 346), when psychologists were only interested in the direction of the observed effect size in their sample.  Apparently, there was no need to examine whether the observed effect size in a small sample was consistent with a population effect size or whether the sign would replicate in a series of studies.

Although p-rep never replaced p-values (most published p-rep values convert into p-values below .05), the general principles of significance testing were ignored. Instead of increasing alpha, researchers found ways to lower p-values to meet the alpha = .05 criterion. A decade later, the consequences of this attitude towards significance testing are apparent.  Many published findings do not hold up when they are subjected to an actual replication attempt by researchers who are willing to report successes and failures.

In this emerging new era, it is important to teach a new generation of psychologists how to navigate the inescapable problem of inferential statistics: you will make errors. Either you falsely claim a discovery of an effect or you fail to provide sufficient evidence for an effect that does exist.  Errors are part of science. How many and what type of errors will be made depends on how scientists conduct their studies.

Advertisements

What would Cohen say? A comment on p < .005

Most psychologists are trained in Fisherian statistics, which has become known as Null-Hypothesis Significance Testing (NHST).  NHST compares an observed effect size against a hypothetical effect size. The hypothetical effect size is typically zero; that is, the hypothesis is that there is no effect.  The deviation of the observed effect size from zero relative to the amount of sampling error provides a test statistic (test statistic = effect size / sampling error).  The test statistic can then be compared to a criterion value. The criterion value is typically chosen so that only 5% of test statistics would exceed the criterion value by chance alone.  If the test statistic exceeds this value, the null-hypothesis is rejected in favor of the inference that an effect greater than zero was present.

One major problem of NHST is that non-significant results are not considered.  To address this limitation, Neyman and Pearson extended Fisherian statistic and introduced the concepts of type-I (alpha) and type-II (beta) errors.  A type-I error occurs when researchers falsely reject a true null-hypothesis; that is, they infer from a significant result that an effect was present, when there is actually no effect.  The type-I error rate is fixed by the criterion for significance, which is typically p < .05.  This means, that a set of studies cannot produce more than 5% false-positive results.  The maximum of 5% false positive results would only be observed if all studies have no effect. In this case, we would expect 5% significant results and 95% non-significant results.

The important contribution by Neyman and Pearson was to consider the complementary type-II error.  A type-II error occurs when an effect is present, but a study produces a non-significant result.  In this case, researchers fail to detect a true effect.  The type-II error rate depends on the size of the effect and the amount of sampling error.  If effect sizes are small and sampling error is large, test statistics will often be too small to exceed the criterion value.

Neyman-Pearson statistics was popularized in psychology by Jacob Cohen.  In 1962, Cohen examined effect sizes and sample sizes (as a proxy for sampling error) in the Journal of Abnormal and Social Psychology and concluded that there is a high risk of type-II errors because sample sizes are too small to detect even moderate effect sizes and inadequate to detect small effect sizes.  Over the next decades, methodologists have repeatedly pointed out that psychologists often conduct studies with a high risk to fail; that is, to provide empirical evidence for real effects (Sedlemeier & Gigerenzer, 1989).

The concern about type-II errors has been largely ignored by empirical psychologists.  One possible reason is that journals had no problem filling volumes with significant results, while rejecting 80% of submissions that also presented significant results.  Apparently, type-II errors were much less common than methodologists feared.

However, in 2011 it became apparent that the high success rate in journals was illusory. Published results were not representative of studies that were conducted. Instead, researchers used questionable research practices or simply did not report studies with non-significant results.  In other words, the type-II error rate was as high as methodologists suspected, but selection of significant results created the impression that nearly all studies were successful in producing significant results.  The influential “False Positive Psychology” article suggested that it is very easy to produce significant results without an actual effect.  This led to the fear that many published results in psychology may be false positive results.

Doubt about the replicability and credibility of published results has led to numerous recommendations for the improvement of psychological science.  One of the most obvious recommendations is to ensure that published results are representative of the studies that are actually being conducted.  Given the high type-II error rates, this would mean that journals would be filled with many non-significant and inconclusive results.  This is not a very attractive solution because it is not clear what the scientific community can learn from an inconclusive result.  A better solution would be to increase the statistical power of studies. Statistical power is simply the inverse of a type-II error (power = 1 – beta).  As power increases, studies with a true effect have a higher chance of producing a true positive result (e.g., a drug is an effective treatment for a disease). Numerous articles have suggested that researchers should increase power to increase replicability and credibility of published results (e.g., Schimmack, 2012).

In a recent article, a team of 72 authors proposed another solution. They recommended that psychologists should reduce the probability of a type-I error from 5% (1 out of 20 studies) to 0.5% (1 out of 200 studies).  This recommendation is based on the belief that the replication crisis in psychology reflects a large number of type-I errors.  By reducing the alpha criterion, the rate of type-I errors will be reduced from a maximum of 10 out of 200 studies to 1 out of 200 studies.

I believe that this recommendation is misguided because it ignores the consequences of a more stringent significance criterion on type-II errors.  Keeping resources and sampling error constant, reducing the type-I error rate increases the type-II error rate. This is undesirable because the actual type-II error is already large.

For example, a between-subject comparison of two means with a standardized effect size of d = .4 and a sample size of N = 100 (n = 50 per cell) has a 50% risk of a type-II error.  The risk of a type-II error raises to 80%, if alpha is reduced to .005.  It makes no sense to conduct a study with an 80% chance of failure (Tversky & Kahneman, 1971).  Thus, the call for a lower alpha implies that researchers will have to invest more resources to discover true positive results.  Many researchers may simply lack the resources to meet this stringent significance criterion.

My suggestion is exactly opposite to the recommendation of a more stringent criterion.  The main problem for selection bias in journals is that even the existing criterion of p < .05 is too stringent and leads to a high percentage of type-II errors that cannot be published.  This has produced the replication crisis with large file-drawers of studies with p-values greater than .05,  the use of questionable research practices, and publications of inflated effect sizes that cannot be replicated.

To avoid this problem, researchers should use a significance criterion that balances the risk of a type-I and type-II error.  For example, with an expected effect size of d = .4 and N = 100, researchers should use p < .20 for significance, which reduces the risk of a type -II error to 20%.  In this case, type-I and type-II error are balanced.  If the study produces a p-value of, say, .15, researchers can publish the result with the conclusion that the study provided evidence for the effect. At the same time, readers are warned that they should not interpret this result as strong evidence for the effect because there is a 20% probability of a type-I error.

Given this positive result, researchers can then follow up their initial study with a larger replication study that allows for a stricter type-I error control, while holding power constant.   With d = 4, they now need N = 200 participants to have 80% power and alpha = .05.  Even if the second study does not produce a significant result (the probability that two studies with 80% power are significant is only 64%, Schimmack, 2012), researchers can combine the results of both studies and with N = 300, the combined studies have 80% power with alpha = .01.

The advantage of starting with smaller studies with a higher alpha criterion is that researchers are able to test risky hypothesis with a smaller amount of resources.  In the example, the first study used “only” 100 participants.  In contrast, the proposal to require p < .005 as evidence for an original, risky study implies that researchers need to invest a lot of resources in a risky study that may provide inconclusive results if it fails to produce a significant result.  A power analysis shows that a sample size of N = 338 participants is needed to have 80% power for an effect size of d = .4 and p < .005 as criterion for significance.

Rather than investing 300 participants into a risky study that may produce a non-significant and uninteresting result (eating green jelly beans does not cure cancer), researchers may be better able and willing to start with 100 participants and to follow up an encouraging result with a larger follow-up study.  The evidential value that arises from one study with 300 participants or two studies with 100 and 200 participants is the same, but requiring p < .005 from the start discourages risky studies and puts even more pressure on researchers to produce significant results if all of their resources are used for a single study.  In contrast, lowering alpha reduces the need for questionable research practices and reduces the risk of type-II errors.

In conclusion, it is time to learn Neyman-Pearson statistic and to remember Cohen’s important contribution that many studies in psychology are underpowered.  Low power produces inconclusive results that are not worthwhile publishing.  A study with low power is like a high-jumper that puts the bar too high and fails every time. We learned nothing about the jumpers’ ability. Scientists may learn from high-jump contests where jumpers start with lower and realistic heights and then raise the bar when they succeeded.  In the same manner, researchers should conduct pilot studies or risky exploratory studies with small samples and a high type-I error probability and lower the alpha criterion gradually if the results are encouraging, while maintaining a reasonably low type-II error.

Evidently, a significant result with alpha = .20 does not provide conclusive evidence for an effect.  However, the arbitrary p < .005 criterion also fails short of demonstrating conclusively that an effect exists.  Journals publish thousands of results a year and some of these results may be false positives, even if the error rate is set at 1 out of 200. Thus, p < .005 is neither defensible as a criterion for a first exploratory study, nor conclusive evidence for an effect.  A better criterion for conclusive evidence is that an effect can be replicated across different laboratories and a type-I error probability of less than 1 out of a billion (6 sigma).  This is by no means an unrealistic target.  To achieve this criterion with an effect size of d = .4, a sample size of N = 1,000 is needed.  The combined evidence of 5 labs with N = 200 per lab would be sufficient to produce conclusive evidence for an effect, but only if there is no selection bias.  Thus, the best way to increase the credibility of psychological science is to conduct studies with high power and to minimize selection bias.

This is what I believe Cohen would have said, but even if I am wrong about this, I think it follows from his futile efforts to teach psychologists about type-II errors and statistical power.

Personalized Adjustment of p-values for publication bias

The logic of null-hypothesis significance testing is straightforward (Schimmack, 2017). The observed signal in a study is compared against the noise in the data due to sampling variation.  This signal to noise ratio is used to compute a probability; p-value.  If this p-value is below a threshold, typically p < .05,  it is assumed that the observed signal is not just noise and the null-hypothesis is rejected in favor of the hypothesis that the observed signal reflects a true effect.

NHST aims to keep the probability of a false positive discovery at a desirable rate. With p < .05, no more than 5% of ALL statistical tests can be false positives.  In other words, the long-run rate of false positive discoveries cannot exceed 5%.

The problem with the application of NHST in practice is that not all statistical results are reported. As a result, the rate of false positive discoveries can be much higher than 5% (Sterling, 1959; Sterling et al., 1995) and statistical significance no longer provides meaningful information about the probability of false positive results.

In order to produce meaningful statistical results it would be necessary to know how many statistical tests were actually performed to produce published significant results. This set of studies includes studies with non-significant results that remained unpublished. This set of studies is often called researchers’ file-drawer (Rosenthal, 1979).  Schimmack and Brunner (2016) developed a statistical method that estimates the size of researchers’ file drawer.  This makes it possible to correct reported p-values for publication bias so that p-values resume their proper function of providing statistical evidence about the probability of observing a false-positive result.

The correction process is first illustrated with a powergraph for statistical results reported in 103 journals in the year 2016 (see 2016 Replicability Rankings for more details).  Each test statistic is converted into an absolute z-score.  Absolute z-scores quantify the signal to noise ratio in a study.  Z-scores can be compared against the standard normal distribution that is expected from studies without an effect (the null-hypothesis).  A z-score of 1.96 (see red dashed vertical line in the graph) corresponds to the typical p < .05 (two-tailed) criterion.  The graph below shows that 63% of reported test statistics were statistically significant using this criterion.

All.2016.Ranking.Journals.Combined

Powergraphs use a statistical method, z-curve (Schimmack & Brunner, 2016) to model the distribution of statistically significant z-scores (z-scores > 1.96).  Based on the model results, it estimates how many non-significant results one would expect. This expected distribution is shown with the grey curve in the figure. The grey curve overlaps with the green and black curve. It is clearly visible that the estimated number of non-significant results is much larger than the actually reported number of non-significant results (the blue bars of z-scores between 0 and 1.96).  This shows the size of the file-drawer.

Powergraphs provide important information about the average power of studies in psychology.  Power is the average probability of obtaining a statistically significant result in the set of all statistical tests that were conducted, including the file drawer.  The estimated power is 39%.  This estimate is consistent with other estimates of power (Cohen, 1962; Sedlmeier & Gigerenzer, 1989), and below the acceptable minimum of 50% (Tversky and Kahneman, 1971).

Powergraphs also provide important information about the replicability of significant results. A published significant result is used to support the claim of a discovery. However, even a true discovery may not be replicable if the original study had low statistical power. In this case, it is likely that a replication study produces a false negative result; it fails to affirm the presence of an effect with p < .05, even though an effect actually exists. The powergraph estimate of replicability is 70%.  That is, any randomly drawn significant effect published in 2016 has only a 70% chance of reproducing a significant result again in an exact replication study.

Importantly, replicability is not uniform across all significant results. Replicabilty increases with the signal to noise ratio (Open Science Collective, 2015). In 2017 powergraphs were enhanced by providing information about the replicability for different levels of strength of evidence. In the graph below, z-scores between 0 and 6 are divided into 12 categories with a width of 0.5 standard deviations (0-0.5, 0.5-1, …. 5.5-6). For significant results, these values are the average replicability for z-scores in the specified range.

The graph shows a replicability estimate of 46% for z-scores between 2 and 2.5. Thus, a z-score greater than 2.5 is needed to meet the minimum standard of 50% replicability.  More important, these power values can be converted into p-values because power and p-values are monotonically related (Hoenig & Heisey, 2001).  If p < .05 is the significance criterion, 50% power corresponds to a p-value of .05.  This also means that all z-scores less than 2.5 correspond to p-values greater than .05 once we take the influence of publication bias into account.  A z-score of 2.6 roughly corresponds to a p-value of .01.  Thus, a simple heuristic for readers of psychology journals is to consider only p < .01 values as significant, if they want to maintain the nominal error rate of 5%.

One problem with a general adjustment is that file drawers can differ across journals or authors.  The adjustment based on the general publication bias across journals will penalize authors who invest resources into well-designed studies with high power and it will fail to adjust fully for the effect of publication bias for authors that conduct many underpowered studies that capitalize on chance to produce significant results. It is widely recognized that scientific markets reward quantity of publications over quality.  A personalized adjustment can solve this problem because authors with large file drawers will get a  bigger adjustment and many of their nominally significant result will no longer be significant after an adjustment for publication bias has been made.

I illustrate this with two real world examples. The first example shows the powergraph of Marcel Zeelenberg.  The left powergraph shows a model that assumes no file drawer. The model fits the actual distribution of z-scores rather well. However, the graph shows a small bump of just significant results (z = 2 to 2.2) that is not explained by the model. This bump could reflect the use of questionable research practices (QRPs)but it is relatively small (as we will see shortly).  The graph on the right side uses only statistically significant results. This is important because only these results were published to claim a discovery. We see how the small bump leads has a strong effect on the estimate of the file drawer. It would require a large set of non-significant results to produce this bump. It is more likely that QRPs were used to produce it. However, the bump is small and overall replicability is higher than the average for all journals.  We also see that z-scores between 2 and 2.5 have an average replicability estimate of 52%. This means no adjustment is needed and p-values reported by Marcel Zellenberg can be interpreted without adjustment. Over the 15 year period, Marcel Zellenberg reported 537 significant results and we can conclude from this analysis that no more than 5% (27) of these results are false positive results.

Powergraphs for Marcel Zeelenberg.spex.png

 

A different picture emerges for the powergraph based on Ayalet Fishbach’s statistical results. The left graph shows a big bump of just significant results that is not explained by a model without publication bias.  The right graph shows that the replicabilty estimate is much lower than for Marcel Zeelenberg and for the analysis of all journals in 2016.

Powergraphs for Ayelet Fishbach.spex.png

The average replicabilty estimate for z-values between 2 and 2.5 is only 33%.  This means that researchers are unlikely to obtain a significant result, if they attempted an exact replication study of one of these findings.  More important, it means that p-values adjusted for publication bias are well above p > .05.  Even z-scores in the 2.5 to 3 band average only a replicabilty estimate of 46%. This means that only z-scores greater than 3 produce significant results after the correction for publication bias is applied.

Non-Significance Does Not Mean Null-Effect 

It is important to realize that a non-significant result does not mean that there is no effect. Is simply means that the signal to noise ratio is too weak to infer that an effect was present.  it is entirely possible that Ayelet Fishbach made theoretically correct predictions. However, to provide evidence for her hypotheses, she conducted studies with a high failure rate and many of these studies failed to support her hypotheses. These failures were not reported but they have to be taken into account in the assessment of the risk of a false discovery.  A p-value of .05 is only meaningful in the context of the number of attempts that have been made.  Nominally a p-value of .03 may appear to be the same across statistical analysis. But the real evidential value of a p-value is not equivalent.  Using powergraphs to equate evidentival value, a p-value of .05 published by Marcel Zeelenberg is equivalent to a p-value of .005 (z = 2.8) published by Ayelet Fischbach.

The Influence of Questionable Research Practices 

Powergraphs assume that an excessive number of significant results is caused by publication bias. However, questionable research practices also contribute to the reporting of mostly successful results.  Replicability estimates and the p-value ajdustment for publication bias may itself be biased by the use of QRPs.  Unfortunately, this effect is difficult to predict because different QRPs have different effects on replicability estimates. Some QRPs will lead to an overcorrection.  Although this creates uncertainty about the right amount of adjustment, a stronger adjustment may have the advantage that it could deter researchers from using QRPs because it would undermine the credibility of their published results.

Conclusion 

Over the past five years, psychologists have contemplated ways to improve the credibitliy and replicability of published results.  So far, these ideas have yet to show a notable effect on replicability (Schimmack, 2017).  One reason is that the incentive structure rewards number of publications and replicability is not considered in the review process. Reviewers and editors treat all p-values as equal, when they are not.  The ability to adjust p-values based on the true evidential value that they provide may help to change this.  Journals may lose their impact once readers adjust p-values and realize that many nominally significant result are actually not statistically significant after taking publication bias into account.

 

Meta-Psychology: A new discipline and a new journal (draft)

Ulrich Schimmack and Rickard Carlsson

Psychology is a relatively young science that is just over 100 years old.  During its 100 years if existence, it has seen major changes in the way psychologists study the mind and behavior.  The first laboratories used a mix of methods and studied a broad range of topics. In the 1950s, behaviorism started to dominate psychology with studies of animal behavior. Then cognitive psychology took over and computerized studies with reaction time tasks started to dominate. In the 1990s, neuroscience took off and no top ranked psychology department can function without one or more MRI magnets. Theoretical perspectives have also seen major changes.  In the 1960s, personality traits were declared non-existent. In the 1980, twin studies were used to argue that everything is highly heritable, and nowadays gene-environment interactions and epigenetics are dominating theoretical perspectives on the nature-nurture debate. These shifts in methods and perspectives are often called paradigm shifts.

It is hard to keep up with all of these paradigm shifts in a young science like psychology. Moreover, many psychology researchers are busy just keeping up with developments in their paradigm. However, the pursuit of advancing research within a paradigm can be costly for researchers and a science as a whole because this research may become obsolete after a paradigm shift. One senior psychologist once expressed regret that he was a prisoner of a paradigm. To avoid a similar fate, it is helpful to have a broader perspective of developments in the field and to understand how progress in one area of psychology fits into the broader goal of understanding humans’ minds and behaviors.  This is the aim of meta-psychology.  Meta-psychology is the scientific investigation of psychology as a science.  It questions the basic assumptions that underpin research paradigm and monitors the progress of psychological science as a whole.

Why we Need a Meta-Psychology Journal 

Most scientific journals focus on publishing original research articles or review articles (meta-analyses) of studies on a particular topic.  This makes it difficult to publish meta-psychological articles.  As publishing in peer-reviewed journals is used to evaluate researchers, few researches dedicated time and energy to meta-psychology and those that did often had difficulties finding an outlet for their work.

In 2006, Ed Diener created Perspectives on Psychological Science (PPS) published by the Association for Psychological Science.  The journal aims to publish an “eclectic mix of provocative reports and articles, including broad integrative reviews, overviews of research programs, meta-analyses, theoretical statements, and articles on topics such as the philosophy of science, opinion pieces about major issues in the field, autobiographical reflections of senior members of the field, and even occasional humorous essays and sketches”   Not all of the articles in PPS are meta-psychology. However, PPS created a home for meta-psychological articles.  We carefully examined articles in PPS to identify content areas of meta-psychology.

We believe that MP can fulfill an important role in the growing number of psychology journals.  Most important, PPS can only publish a small number of articles.  For profit journals like PPS pride themselves on their high rejection rates.  We believe that high rejection rates create a problem and give editors and reviewers too much power to shape the scientific discourse and direction of psychology.  The power of editors is itself an important topic in meta-psychology.  In contrast to PPS, MP is an online journal with no strict page limits.  We will let the quality of published articles rather than rejection rates determine the prestige of our journal.

PPS is a for profit journal and published content is hidden behind paywalls. We think this is a major problem and does not serve the interest of scientists.  All articles published in MP will be open access.  One problem with some open access journals is that they charge high fees for authors to get their work published.  This gives authors from rich countries with grants a competitive advantage. MP will not charge any fees.

In short, while we appreciate the contribution PPS has made to the development of meta-psychology, we see MP as a modern journal that meets the need of psychology as a science for a journal that is dedicated to publishing meta-psychological articles without high rejection rates and without high costs to authors and readers.

Content Areas of Meta-Psychology 

1. Critical reflections on the process of data collection.

1.1.  Sampling

Amazon’s Mechanical Turk: A New Source of Inexpensive, Yet High-Quality, Data?
By: Buhrmester, Michael; Kwang, Tracy; Gosling, Samuel D.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 6   Issue: 1   Pages: 3-5   Published: JAN 2011

1.2.  Experimental Paradigms

Using Smartphones to Collect Behavioral Data in Psychological Science: Opportunities, Practical Considerations, and Challenges
By: Harari, Gabriella M.; Lane, Nicholas D.; Wang, Rui; et al.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 11   Issue: 6   Pages: 838-854   Published: NOV 2016

1.3. Validity

What Do Implicit Measures Tell Us? Scrutinizing the Validity of Three Common Assumptions
By: Gawronski, Bertram; Lebel, Etienne P.; Peters, Kurt R.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 2   Issue: 2   Pages: 181-193   Published: JUN 2007

 

2.  Critical reflections on statistical methods / tutorials on best practices

2.1.  Philosophy of Statistics

Bayesian Versus Orthodox Statistics: Which Side Are You On?
By: Dienes, Zoltan
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 6   Issue: 3   Pages: 274-290   Published: MAY 2011

2.2. Tutorials

Sailing From the Seas of Chaos Into the Corridor of Stability Practical Recommendations to Increase the Informational Value of Studies
By: Lakens, Daniel; Evers, Ellen R. K.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 9   Issue: 3   Pages: 278-292   Published: MAY 2014

3. Critical reflections on published results / replicability

3.1.  Fraud

Scientific Misconduct and the Myth of Self-Correction in Science
By: Stroebe, Wolfgang; Postmes, Tom; Spears, Russell
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 7   Issue: 6   Pages: 670-688   Published: NOV 2012

3.2. Publication Bias

Puzzlingly High Correlations in fMRI Studies of Emotion, Personality, and Social Cognition
By: Vul, Edward; Harris, Christine; Winkielman, Piotr; et al.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 4   Issue: 3   Pages: 274-290   Published: MAY 2009

3.3. Quality of Peer-Review

The Air We Breathe: A Critical Look at Practices and Alternatives in the Peer-Review Process
By: Suls, Jerry; Martin, Rene
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 4   Issue: 1   Pages: 40-50   Published: JAN 2009

4. Critical reflections on Paradigms and Paradigm Shifts

4.1  History

Sexual Orientation Differences as Deficits: Science and Stigma in the History of American Psychology
By: Herek, Gregory M.
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 5   Issue: 6   Pages: 693-699   Published: NOV 2010

4.2. Topics

Domain Denigration and Process Preference in Academic Psychology
By: Rozin, Paul
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 1   Issue: 4   Pages: 365-376   Published: DEC 2006

4.3 Incentives

Giving Credit Where Credit’s Due: Why It’s So Hard to Do in Psychological Science
By: Simonton, Dean Keith
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 11   Issue: 6   Pages: 888-892   Published: NOV 2016

4.5 Politics

Political Diversity in Social and Personality Psychology
By: Inbar, Yoel; Lammers, Joris
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 7   Issue: 5   Pages: 496-503   Published: SEP 2012

4.4. Paradigms

Why the Cognitive Approach in Psychology Would Profit From a Functional Approach and Vice Versa
By: De Houwer, Jan
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 6   Issue: 2   Pages: 202-209   Published: MAR 2011

5. Critical reflections on teaching and dissemination of research

5.1  Teaching

Teaching Replication
By: Frank, Michael C.; Saxe, Rebecca
PERSPECTIVES ON PSYCHOLOGICAL SCIENCE   Volume: 7   Issue: 6   Pages: 600-604   Published: NOV 2012

5.2. Coverage of research in textbooks

N.A.

5.2  Coverage of psychology in popular books

N.A.

5.3  Popular Media Coverage of Psychology

N.A.

5.4. Social Media and Psychology

N.A.

 

Vision and Impact Statement

Currently PPS ranks number 7 out of all psychology journals with an Impact Factor of 6.08. The broad appeal of meta-psychology accounts for this relatively high impact factor. We believe that many articles published in MP will also achieve high citation rates, but we do not compete for the highest ranking.  A journal that publishes only 1 article a year, will get a higher ratio of citations per article than a journal that publishes 10 articles a year.  We recognize that it is difficult to predict which articles will become citation classics and we rather publish one gem and nine so-so articles than miss out on publishing the gem. We anticipate that MP will publish many gems that PPS rejected and we will be happy to give these articles a home.

This does not mean, MP will publish everything. We will harness the wisdom of crowds and we encourage authors to share their manuscripts on pre-publication sites or on social media for critical commentary.  In addition, reviewers will help authors to improve their manuscript, while authors can be assured that investing in major revisions will be rewarded with a better publication rather than an ultimate rejection that requires further changes to please editors at another journal.

 

 

 

 

An Attempt at Explaining Null-Hypothesis Testing and Statistical Power with 1 Figure and 1,500 Words

Is a Figure worth 1,500 words?

gpower-zcurve

Gpower. http://www.gpower.hhu.de/en.html

Significance Testing

1. The red curve shows the sampling distribution if there is no effect. Most results will give a signal/noise ratio close to 0 because there is no effect (0/1 = 0)

2. Sometimes sampling error can produce large signals, but these events are rare

3. To be sure that we have a real signal, we can chose a high criterion to decide that there was an effect (reject H0). Normally, we use a 2:1 ratio (z > 2) to do so, but we could use a higher or lower criterion value.  This value is shown by the green vertical line in the Figure

4. z-score greater than 2 leaves only 2.5% of the red distribution. This means we would expect only 2.5% of outcomes with z-scores greater than 2 if there is no effect. If we would use the same criterion for negative effects, we would get another 2.5% in the lower tail of the red distribution. Combined we would have 5% of cases where we have a false positive, that is, we decide that there is an effect when there was no effect. This is why we say, p < .05 to call a result significant. The probabilty (p) of a false positive result is no greater than 5% if we keep on repeating studies and using z > 2 as the criterion to claim an effect. If there is never an effect in any of the studies we are doing, we end up with 5% false positive results. A false positive is also called a type-I error. We are making the mistake to infer from our study that an effect is present when there is no effect.

Statistical Power

5. Now that you understand significance testing (LOL), we can introduce the concept of statistical power. Effects can be large or small. For example, gender differences in height are large, gender differences in the number of sexual partners are small.  Also studies can have a lot of sampling error or very little sampling error.  A study of 10 men and 10 women may accidentally include 2 women who are on the basketball team.  A study of 1000 men and women is likely to be more representative of the population.  Based on the effect size in the population and sample size, the true signal (effect size in the population) to noise (sampling error) ratio can differ.  The higher the signal to noise ratio is, the further away the sampling distribution of the real data (the blue curve) will be.  In the figure below the population effect size and sampling error produced a z-score of 2.8, but actual samples will never produce this value. Sampling error will again produce different z-scores above or below the expected value of 2.8.  Most samples will produce values close to 2.8, but some samples will produce more extreme deviations.  Samples that overestimate the expected value of 2.8 are not a problem because these values are all greater than the criterion for statistical significance. So, in all of these samples we will make the right decision to infer that an effect is present when an effect is present. A so called true positive result.  Even if sampling error leads to a small underestimation of the expected value of 2.8, the values can still be above the criterion for statistical significance and we get a true positive result.

6. When sampling error leads to more extreme underestimation of the expected value of 2.8, samples may produce results with a z-score less than 2.  Now the result is no longer statistically significant. These cases are called false negatives or type-II errors.  We fail to infer that an effect is present, when there actually is an effect (think about a faulty pregnancy test that fails to detect that a woman is pregnant).  It does not matter whether we actually infer that there is no effect or remain indecisive about the presence of an effect. We did a study where an effect exists and we failed to provide sufficient evidence for it.

7. The Figure shows the probability of making a type-II error as the area of the blue curve on the left side of the green line.  In this example, 20% of the blue curve is on the left side of the green line. This means 20% of all samples with an expected value of 2.8 will produce false negative results.

8. We can also focus on the area of the blue curve on the right side of the green line.  If 20% of the area is on the left side, 80% of the area must be on the right side.  This means, we have an 80% probability to obtain a true positive result; that is, a statistically significant result where the observed z-score is greater than the criterion z-score of 2.   This probability is called statistical power.  A study with high power has a high probability to discover real effects by producing z-scores greater than the criterion value. A study with low power has a high probability to produce a false negative result by producing z-scores below the criterion value.

9. Power depends on the criterion value and the expected value.  We could reduce the type-II error and increase power in the Figure by moving the green line to the left.  As we reduce the criterion to claim an effect, we reduce the area of the blue curve on the left side of the line. We are now less likely to encounter false negative results when an effect is present.  However, there is a catch.  By moving the green line to the left, we are increasing the area of the red curve on the right side of the red curve. This means, we are increasing the probability of a false positive result.  To avoid this problem we can keep the green line where it is and move the expected value of the blue line to the right.  By shifting the blue curve to the right, a smaller area of the blue curve will be on the left side of green line.

10. In order to move the blue curve to the right we need to increase the effect size or reduce sampling error.  In experiments it may be possible to use more powerful manipulations to increase effect sizes.  However, often increasing effect sizes is not an option.  How would you increase the effect size of sex on sexual partners?  Therefore, your best option is to reduce sampling error.  As sampling error decreases, the blue curve moves further to the right and statistical power increases.

Practical Relevance: The Hunger Games of Science: With high power the odds are always in your favor

10. Learning about statistical power is important because the outcome of your studies does not just depend on your expertise. It also depends on factors that are not under your control. Sampling error can sometimes help you to get significance by giving you z-scores higher than the expected value, but these z-scores will not replicate because sampling error can also be your enemy and lower your z-scores.  In this way, each study that you do is a bit like playing the lottery or a box of chocolates. You never know how much sampling error you will get.  The good news is that you are in charge of the number of winning tickets in the lottery.  A study with 20% power, has only 20% winning tickets.  The other 80% say, “please play again.”  A study with 80% power has 80% winning tickets.  You have a high chance to get a significant result and you or others will be able to redo the study and again have a high chance to replicate your original result.  It can be embarrassing when somebody conducts a replication study of your significant result and ends up with a failure to replicate your finding.  You can avoid this outcome by conducting studies with high statistical power.

11. Of course, there is a price to pay. Reducing sampling error often requires more time and participants. Unfortunately, the costs increase exponentially.  It is easier to increase statistical power from 20% to 50% than to increase it from 50% to 80%. It is even more costly to increase it from 80% to 90%.  This is what economists call diminishing marginal utility.  Initially you get a lot of bang for your buck, but eventually the costs for any real gains are too high.  For this reason, Cohen (1988) recommended that researchers should aim for 80% power in their studies.  This means that 80% of your initial attempts to demonstrate an effect will succeed when your hard work in planning and conducting a study produced a real effect.  For 20% of the study you may either give up or try again to see whether your fist study produced a true negative result (there is no effect) or a false negative result (you did everything correctly, but sampling error handed you a losing ticket.  Failure is part of life, but you have some control over the amount of failures that you encounter.

12. The End. You are now ready to learn how you can conduct power analysis for actual studies to take control your fate.  Be a winner, not a loser.

 

Replicability Review of 2016

2016 was surely an exciting year for anybody interested in the replicability crisis in psychology. Some of the biggest news stories in 2016 came from attempts by the psychology establishment to downplay the replication crisis in psychological research (Weired Magazine). At the same time, 2016 delivered several new replication failures that provide further ammunition for the critics of established research practices in psychology.

I. The Empire Strikes Back

1. The Open Science Collaborative Reproducibility Project was flawed.

Daniel Gilbert, Tim Wilson published a critique of the Open Science Collaborative in Science. According to Gilbert and Wilson the project that replicated 100 original research studies and reported that they could only replicate 36% was error riddled. Consequently, the low success rate only reveals the incompetence of replicators and has no implications for the replicability of original studies published in prestigious psychological journals like Psychological Science. Science Daily suggested that the critique overturned the landmark study.

science-daily-overturn

Nature published a more balanced commentary.  In an interview, Gilbert explains that “the number of studies that actually did fail to replicate is about the number you would expect to fail to replicate by chance alone — even if all the original studies had shown true effects.”   This quote is rather strange, if we really consider the replication studies as flawed and error riddled.  If the replication studies were bad, we would expect fewer studies to replicate than we would expect based on chance alone.  If the success rate of 36% is consistent with the effect of chance alone, the replication studies are just as good as the original studies and the only reason for non-significant results would be chance. Thus, Gilbert’s comment implies that he believes the typical statistical power of a study in psychology is about 36%. Gilbert doesn’t seem to realize that he is inadvertently admitting that published articles report vastly inflated success rates because 97% of the original studies reported a significant result.  To report 97% significant results with an average power of 36%, researchers are either hiding studies that failed to support their favored hypotheses in proverbial file-drawers or they are using questionable research practices to inflate evidence in favor of their hypotheses. Thus, ironically Gilberts’ comments rather confirm the critiques of the establishment that the low success rate in the reproducibility project can be explained by selective reporting of evidence that supports authors’ theoretical predictions.

2. Contextual Sensitivity Explains Replicability Problem in Social Psychology

Jay van Bavel and colleagues made a second attempt to downplay the low replicability of published results in psychology. He even got to write about it in the New York Times.

vanbavel-nyt

Van Bavel blames the Open Science Collaboration for overlooking the importance context. “Our results suggest that many of the studies failed to replicate because it was difficult to recreate, in another time and place, the exact same conditions as those of the original study.”   This statement caused a lot of bewilderment.  First, the OSC carefully tried to replicate the original studies as closely as possible.  At the same time, they were sensitive to the effect of context. For example, if a replication study of an original study in the US was carried out in Germany, stimulus words were translated from English into German because one might expect that native German speakers might not respond the same way to the original English words as native English speakers.  However, the switching of languages means that the replication study is not identical to the original study. Maybe the effect can only be obtained with English speakers. And if the study was conducted at Harvard, maybe the effect can only be replicated with Harvard students. And if the study was conducted primarily with female students, it may not replicate with male students.

To provide evidence for his claim, Jay van Bavel obtained subjective ratings of contextual sensitivity. That is, raters guessed how sensitivity the outcome of a study is to variations in the context.  These ratings were then used to predict the success of the 100 replication studies in the OSC project.

Jay van Bavel proudly summarized the results in the NYT article. “As we predicted, there was a correlation between these context ratings and the studies’ replication success: The findings from topics that were rated higher on contextual sensitivity were less likely to be reproduced. This held true even after we statistically adjusted for methodological factors like sample size of the study and the similarity of the replication attempt. The effects of some studies could not be reproduced, it seems, because the replication studies were not actually studying the same thing.”

The article leaves out a few important details.  First, the correlation between contextual sensitivity ratings and replication success was small, r = .20.  Thus, even if contextual sensitivity contributed to replication failures, it would only explain replication failures for a small percentage of studies. Second, the authors used several measures of replicability and some of these measures failed to show the predicted relationship. Third, the statement makes an elementary mistake of confusing correlation and causality.  The authors merely demonstrated that subjective ratings of contextual sensitivity predicted outcomes of replication studies. They did not show that contextual sensitivity caused replication failures.  Most important, Jay van Bavel failed to mention that they also conducted an analysis that controlled for discipline. The Open Science Collaborative had already demonstrated that studies in cognitive psychology are more replicable (50% success rate) than studies in social psychology (an awful 25%).  In an analysis that controlled for differences in disciplines, contextual sensitivity was no longer a statistically significant predictor of replication failures.  This hidden fact was revealed in a commentary (or should we say correction) by Joel Inbar.  In conclusion, this attempt at propping up the image of social psychology as a respectable science with replicable results turned out to be another embarrassing example of sloppy research methodology.

3. Anti-Terrorism Manifesto by Susan Fiske

Later that year, former president of the Association for Psychological Science (APS) caused a stir by comparing critics of established psychology to terrorists (see Business Insider article).  She later withdrew the comparison to terrorists in response to the criticism of her remarks on social media (APS website).

Fiske.png

Fiske attempted to defend established psychology by arguing that established psychology is self-correcting and does not require self-appointed social-media vigilantes. She claimed that these criticisms were destructive and damaging to psychology.

“Our field has always encouraged — required, really — peer critiques.”

“To be sure, constructive critics have a role, with their rebuttals and letters-to-the-editor subject to editorial oversight and peer review for tone, substance, and legitimacy.”

“One hopes that all critics aim to improve the field, not harm people. But the fact is that some inappropriate critiques are harming people. They are a far cry from temperate peer-reviewed critiques, which serve science without destroying lives.”

Many critics of established psychology did not share Fiske’s rosy and false description of the way psychology operates.  Peer-review has been shown to be a woefully unreliable process. Moreover, the key criterion for accepting a paper is that it presents flawless results that seem to support some extraordinary claims (a 5-minute online manipulation reduces university drop-out rates by 30%), no matter how these results were obtained and whether they can be replicated.

In her commentary, Fiske is silent about the replication crisis and does not reconcile her image of a critical peer-review system with the fact that only 25% of social psychological studies are replicable and some of the most celebrated findings in social psychology (e.g., elderly priming) are now in doubt.

The rise of blogs and Facebook groups that break with the rules of the establishment poses a threat to the APS establishment with the main goal of lobbying for psychological research funding in Washington. By trying to paint critics of the establishment as terrorists, Fiske tried to dismiss criticism of established psychology without having to engage with the substantive arguments why psychology is in crisis.

In my opinion her attempt to do so backfired and the response to her column showed that the reform movement is gaining momentum and that few young researchers are willing to prop up a system that is more concerned about publishing articles and securing grant money than about making real progress in understanding human behavior.

II. Major Replication Failures in 201

4. Epic Failure to Replicate Ego-Depletion Effect in a Registered Replication Report

Ego-depletion is a big theory in social psychology and the inventor of the ego-depletion paradigm, Roy Baumeister, is arguable one of the biggest names in contemporary social psychology.  In 2010, a meta-analysis seemed to confirm that ego-depletion is a highly robust and replicable phenomenon.  However, this meta-analysis failed to take publication bias into account.  In 2014, a new meta-analysis revealed massive evidence of publication bias. It also found that there was no statistically reliable evidence for ego-depletion after taking publication bias into account (Slate, Huffington Post).

Ego.Depletion.Crumbling.png

A team of researchers, including the first-author of the supportive meta-analysis from 2010, conducted replication studies, using the same experiment in 24 different labs.  Each of these studies alone would have had a low probability to detect a small ego depletion effect, but the combined evidence from all 24 labs made it possible to detect an ego-depletion effect even if it were much smaller than published articles suggest.  Yet, the project failed to find any evidence for an ego-depletion effect, suggesting that it is much harder to demonstrate ego-depletion effects than one would believe based on over 100 published articles with successful results.

Critics of Baumeister’s research practices (Schimmack) felt vindicated by this stunning failure. However, even proponents of ego-depletion theory (Inzlicht) acknowledged that ego-depletion theory lacks a strong empirical foundation and that it is not clear what 20 years of research on ego-depletion have taught us about human self-control.

Not so, Roy Baumeister.  Like a bank that is too big to fail, Baumeister defended ego-depletion as a robust empirical finding and blamed the replication team for the negative outcome.  Although he was consulted and approved the design of the study, he later argued that the experimental task was unsuitable to induce ego-depletion. It is not hard to see the circularity in Baumeister’s argument.  If a study produces a positive result, the manipulation of ego-depletion was successful. If a study produces a negative result, the experimental manipulation failed. The theory is never being tested because it is taken for granted that the theory is true. The only empirical question is whether an experimental manipulation was successful.

Baumeister also claimed that his own lab has been able to replicate the effect many times, without explaining the strong evidence for publication bias in the ego-depletion literature and the results of a meta-analysis that showed results from his own lab are no different from results from other labs.

A related article by Baumeister in a special issue on the replication crisis in psychology was another highlight in 2016.  In this article, Baumeister introduced the concept of FLAIR.

scientist-with-flair   Scientist with FLAIR

Baumeister writes “When I was in graduate school in the 1970s, n=10 was the norm, and people who went to n=20 were suspected of relying on flimsy effects and wasting precious research participants. Over the years the norm crept up to about n = 20. Now it seems set to leap to n = 50 or more.” (JESP, 2016, p. 154).  He misses the god old days and suggests that the old system rewarded researchers with flair.  “Patience and diligence may be rewarded, but competence may matter less than in the past. Getting a significant result with n = 10 often required having an intuitive flair for how to set up the most conducive situation and produce a highly impactful procedure. Flair, intuition, and related skills matter much less with n = 50.” (JESP, 2016, p. 156).

This quote explains the low replication rate in social psychology and the failure to replicate ego-depletion effects.   It is simply not possible to conduct studies with n = 10 and be successful in most studies because empirical studies in psychology are subject to sampling error.  Each study with n = 10 on a new sample of participants will produce dramatically different results because sample of n = 10 are very different from each other.  This is a fundamental fact of empirical research that appears to elude on of the most successful empirical social psychologists.  So, a researcher with FLAIR may set up a clever experiment with a strong manipulation (e.g, smelling chocolate cookies and have participants eat radishes instead) and get a significant result. But this is not a replicable finding. For every study with fair that worked, there are numerous studies that did not work. However, researchers with flair ignore these failed studies and focus on the studies that worked and then use these studies for publication.  It can be shown statistically that they do, as I did with Baumeister’s glucose studies (Schimmack, 2012) and Baumeister’s ego-depletion studies in general (Schimmack, 2016).  So, a researchers who gets significant results with small samples (n = 10) surely has FLAIR (False, Ludicrous, And Incredible Results).

Baumeister’s article contained additional insights into the research practices that fueled a highly productive and successful career.  For example, he distinguishes researchers who report boring true positive results and interesting researches who publish interesting false positive results.  He argues that science needs both types of researchers. Unfortunately, most people assume that scientists prioritize truth, which is the main reason for subjecting theories to empirical tests. But scientists with FLAIR get positive results even when their interesting ideas are false (Bem, 2011).

Baumeister mentions psychoanalysis as an example of interesting psychology. What could be more interesting than the Freudian idea that every boy goes through a phase where he wants to kill daddy and make love to mommy.  Interesting stuff, indeed, but this idea has no scientific basis.  In contrast, twin studies suggest that many personality traits, values, and abilities are partially inherited. To reveal this boring fact, it was necessary to recruit large samples of thousands of twins.  That is not something a psychologist with FLAIR can handle.  “When I ran my own experiments as a graduate student and young professor, I struggled to stay motivated to deliver the same instructions and manipulations through four cells of n=10 each. I do not know how I would have managed to reach n=50. Patient, diligent researchers will gain, relative to others” (Baumeister, JESP, 2016, p. 156). So, we may see the demise of researchers with FLAIR and diligent and patient researchers who listen to their data may take their place. Now there is something to look forward to in 2017.

scientist-without-flair Scientist without FLAIR

5. No Laughing Matter: Replication Failure of Facial Feedback Paradigm

A second Registered Replication Report (RRR) delivered another blow to the establishment.  This project replicated a classic study on the facial-feedback hypothesis.  Like other peripheral emotion theories, facial-feedback theories assume that experiences of emotions depend (fully or partially) on bodily feedback.  That is, we feel happy because we smile rather than we smile because we are happy.  Numerous studies had examined the contribution of bodily feedback to emotional experience and the evidence was mixed.  Moreover, studies that found effects had a major methodological problem. Simply asking participants to smile might make them think happy thoughts, which could elicit positive feelings.  In the 1980s, social psychologist Fritz Strack invented a procedure that solved this problem (see Slate article).  Participants are deceived to believe that they are testing a procedure for handicapped people to complete a questionnaire by holding a pen in their mouth.  Participants who hold the pen with their lips are activating muscles that are activated during sadness. Participants who hold the pen with their teeth activate muscles that are activated during happiness.  Thus, randomly assigning participants to one of these two conditions made it possible to manipulate facial muscles without making participants aware of the associated emotion.  Strack and colleagues reported two experiments that showed effects of the experimental manipulation.  Or did it?  It depends on the statistical test being used.

slate-facial-feedback

Experiment 1 had three conditions. The control group did the same study without manipulation of the facial muscles. The dependent variable was funniness ratings of cartoons.  The mean funniness of cartoons was highest in the smile condition, followed by the control condition, and the lowest mean in the frown condition.  However, a commonly used Analysis of Variance would not have produced a significant result.  A two-tailed t-test also would not have produced a significant result.  However a linear contrast with a one-tailed t-test produced a just significant result, t(89) = 1.85, p = .03.  So, Fritz Strack was rather lucky to get a significant result.  Sampling error could have easily changed the pattern of means slightly and even the directional test of the linear contrast would not have been significant.  At the same time, sampling error might have been against the facial feedback hypothesis and the real effect is stronger than this study suggests. In this case, we would expect to see stronger evidence in Study 2.  However, Study 2 failed to show any effect on funniness ratings of cartoons.  “As seen in Table 2, subjects’ evaluations of the cartoons were hardly affected under the different experimental conditions. The ANOVA showed no significant main effects or interactions, all ps > .20” (Strack et al., 1988).  However, Study 2 also included amusement ratings, and the amusement ratings once more showed a just significant result with a one-tailed t-test, t(75) = 1.78, p = .04.  The article also provides an explanation for the just-significant result in Study 1, even though Study 1 used funniness ratings of cartoons.  When participants are not asked to differentiate between their subjective feelings of amusement and the objective funniness of cartoons, subjective feelings influence ratings of funniness, but given a chance to differentiate between the two, subjective feelings no longer influence funniness ratings.

For 25 years, this article was uncritically cited as evidence for the facial feedback hypothesis, but none of the 17 labs that participated in the RRR were able to produce a significant result. More important, even an analysis with the combined power of all studies failed to detect an effect.  Some critics pointed out that this result successfully replicates the finding of the original two studies that also failed to report statistically significant results by conventional standards of a two-tailed test (or z > 1.96).

Given the shaky evidence in the original article, it is not clear why Fritz Strack volunteered his study for a replication attempt.  However, it is easier to understand his response to the results of the RRR.  He does not take the results seriously.  He rather believes his two original, marginally significant, studies than the 17 replication studies.

“Fritz Strack has no regrets about the RRR, but then again, he doesn’t take its findings all that seriously. “I don’t see what we’ve learned,” he said.”  (Slate).

One of the most bizarre statements by Strack can only be interpreted as revealing a shocking lack of understanding of probability theory.

“So when Strack looks at the recent data he sees not a total failure but a set of mixed results. Nine labs found the pen-in-mouth effect going in the right direction. Eight labs found the opposite. Instead of averaging these together to get a zero effect, why not try to figure out how the two groups might have differed? Maybe there’s a reason why half the labs could not elicit the effect.” (Slate).

This is like a roulette player who after a night of gambling sees 49% wins and 49% loses and ponders why 49% of the attempts produced losses. Strack does not seem to realize that results of individual studies move simply by chance just like roulette balls produce different results by chance. Some people find cartoons funnier than others and the mean will depend on the allocation of these individuals to the different groups.  This is called sampling error, and this is why we need to do statistical tests in the first place.  And apparently it is possible to become a famous social psychologist without understanding the purpose of computing and reporting p-values.

And the full force of defense mechanisms is apparent in the next statement.  “Given these eight nonreplications, I’m not changing my mind. I have no reason to change my mind,” Strack told me. Studies from a handful of labs now disagreed with his result. But then, so many other studies, going back so many years, still argued in his favor. (Slate).

No, there were not eight non-replications. There were 17!  We would expect half of the studies to match the direction of the original effect simply due to chance alone.

But this is not all.  Strack even accused the replication team of “reverse p-hacking.” (Strack, 2016).  The term p-hacking was coined by Simmons et al. (2011) to describe a set of research practices that can be used to produce statistically significant results in the absence of a real effect (fabricating false positives).  Strack turned it around and suggested that the replication team used statistical tricks to make the facial feedback effect disappear.  “Without insinuating the possibility of a reverse p hacking, the current anomaly needs to be further explored.” (p. 930).

However, the statistical anomaly that requires explanation could just be sampling error (Hillgard) and it actually is the wrong statistical pattern to claim reverse p-hacking.  Reverse p-hacking implies that some studies did produce a significant result, but statistical tricks were used to report the result as non-significant. This would lead to a restriction in the variability of results across studies, which can be detected with the Test for Insufficient Variance (Schimmack, 2015), but there is no evidence for reverse p-hacking in the RRR.

Fritz Strack also tried to make his case on social media, but there was very little support for his view that 17 failed replication studies can be ignored (PsychMAP thread).

strack-psychmap

Strack’s desperate attempts to defend his famous original study in the light of a massive replication failure provide further evidence for the inability of the psychology establishment to face the reality that many celebrated discoveries in psychology rest on shaky evidence and a mountain of repressed failed studies.

Meanwhile the Test of Insufficient Variance provides a simple explanation for the replication failure, namely the original results were rather unlikely to occur in the first place.  Converting the observed t-values into z-scores shows very low variability, Var(z) = 0.003. The probability of observing a variance this small or smaller in a pair of studies is only p = .04.  It is just not very likely for such an improbable event to repeat itself

6. Insufficient Power in Power-Posing Research

When you google “power posing” the featured link shows Amy Cuddy giving a TED talk about her research. Not unlike facial feedback, power posing assumes that bodily feedback can have powerful effects.

Cuddy.Power.Posing.png

When you scroll down to the page, you might find a link to an article by Gelman and Fung (Slate).

Gelman has been an outspoken critic of social psychology for some time.  This article is no exception. “Some of the most glamorous, popular claims in the field are nothing but tabloid fodder. The weakest work with the boldest claims often attracts the most publicity, helped by promotion from newspapers, television, websites, and best-selling books.”

Wonder.Woman.png

They point out that a much larger study than the original study failed to replicate the original findings.

“An outside team led by Eva Ranehill attempted to replicate the original Carney, Cuddy, and Yap study using a sample population five times larger than the original group. In a paper published in 2015, the Ranehill team reported that they found no effect.”

They have little doubt that the replication study can be trusted and suggest that the original results were obtained with the help of questionable research practices.

“We know, though, that it is easy for researchers to find statistically significant comparisons even in a single, small, noisy study. Through the mechanism called p-hacking or the garden of forking paths, any specific reported claim typically represents only one of many analyses that could have been performed on a dataset.”

The replication study was published in 2015, so this replication failure does not really belong into a review of 2016.  Indeed, the big news in 2016 was that Cuddy’s co-author Carney distanced herself from her contribution to the power posing article.   Her public rejection of her own work (New Yorker Magazine) spread like a wildfire through social media (Psych Methods FB Group Posts 1, 2, but  see 3). Most responses were very positive.  Although science is often considered a self-correcting system, individual scientists rarely correct mistakes or retract articles if they discover a mistake after publication.  Carney’s statement was seen as breaking with the implicit norm of the establishment to celebrate every published article as an important discovery and to cover up mistakes even in the face of replication failures.

carney-statement

Not surprisingly, proponent of power posing, Amy Cuddy, defended her claims about power posing. Here response makes many points, but there is one glaring omission. She does not mention the evidence that published results are selected to confirm theoretical claims and she does not comment on the fact that there is no evidence for power posing after correcting for publication bias.  The psychology establishment also appears to be more interested in propping up a theory that has created a lot of publicity for psychology rather than critically examining the scientific evidence for or against power posing (APS Annual Meeting, 2017, Program, Presidential Symposium).

7. Commitment Priming: Another Failed Registered Replication Report

Many research questions in psychology are difficult to study experimentally.  For example, it seems difficult and unethical to study the effect of infidelity on romantic relationships by assigning one group of participants to an infidelity condition and make them engage in non-marital sex.  Social psychologists have developed a solution to this problem.  Rather than creating real situations, participants are primed to think about infidelity. If these thoughts change their behavior, the results are interpreted as evidence for the effect of real infidelity.  Eli Finkel and colleagues used this approach to experimentally test the effect of commitment on forgiveness.  To manipulate commitment, participants in the experimental group were given some statements that were supposed to elicit commitment-related thoughts.  To make sure that this manipulation worked, participants then completed a commitment measure.  In the original article, the experimental manipulation had a strong effect, d = .74, which was highly significant, t(87) = 3.43, p < .001.  Irene Cheung, Lorne Campbell, and Etienne P. LeBel spearheaded an initiative to replicate the experimental effect of commitment priming on forgiveness.  Eli Finkel closely worked with the replication team to ensure that the replication study replicated the original study as closely as possible.  Yet, the replication studies failed to demonstrate effectiveness of the commitment manipulation. Even with the much larger sample size, there was no significant effect and the effect size was close to zero.  The authors of the replication report were surprised by the failure of the manipulation. “It is unclear why the RRR studies observed no effect of priming on subjective commitment when the original study observed a large effect. Given the straightforward nature of the priming manipulation and the consistency of the RRR results across settings, it seems unlikely that the difference resulted from extreme context sensitivity or from cohort effects (i.e., changes in the population between 2002 and 2015).” (PPS, 2016, p. 761).  The author of the original article, Eli Finkel, also has no explanation for the failure of the experimental manipulation. “Why did the manipulation that successfully influenced commitment in 2002 fail to do so in the RRR? I don’t know.” (PPS, 2016, p. 765).  However, Eli Finkel also reports that he made changes to the manipulation in subsequent studies. “The RRR used the first version of a manipulation that has been refined in subsequent work. Although I believe that the original manipulation is reasonable, I no longer use it in my own work. For example, I have become concerned that the “low commitment” prime includes some potentially commitment-enhancing elements (e.g., “What is one trait that your partner will develop as he/she grows older?”). As such, my collaborators and I have replaced the original 5-item primes with refined 3-item primes (Hui, Finkel, Fitzsimons, Kumashiro, & Hofmann, 2014). I have greater confidence in this updated manipulation than in the original 2002 manipulation. Indeed, when I first learned that the 2002 study would be the target of an RRR—and before I understood precisely how the RRR mechanism works—I had assumed that it would use this updated manipulation.” (PPS, 2016, p. 766).   Surprisingly, the potential problem with the original manipulation was never brought up during the planning of the replication study (FB discussion group).

commitment-priming-fb

Hui et al. (2014) also do not mention any concerns about the original manipulation.  They simply wrote “Adapting procedures from previous research (Finkel et al., 2002), participants in the high commitment prime condition answered three questions designed to activate thoughts regarding dependence and commitment.” (JPSP, 2014, p. 561).  The results of the manipulation check closely replicated the results of the 2002 article. “The analysis of the manipulation check showed that participants in the high commitment prime condition (M = 4.62, SD = 0.34) reported a higher level of relationship commitment than participants in the low commitment prime condition (M = 4.26, SD = 0.62), t(74) = 3.11, p < .01.” (JPSP, 2014, p. 561).  The study also produced a just-significant result for a predicted effect of the manipulation on support for partner’s goals that are incompatible with the relationship, relationship, beta = .23, t(73) = 2.01, p = .05.  These just significant results are rare and often fail to replicate in replication studies (OSC, Science, 2016).

Altogether the results of yet another registered replication report raise major concerns about the robustness of priming as a reliable method to alter participants’ beliefs and attitudes.  Selective reporting of studies that “worked” has created an illusion that priming is a very effective and reliable method to study social cognitions. However, even social cognition theories suggest that priming effects should be limited to specific situations and should not have strong effects for judgments that are highly relevant and when chronically accessible information is easily accessible.

8. Concluding Remarks

Looking back 2016 has been a good year for the reform movement in psychology.  High profile replication failures have shattered the credibility of established psychology.  Attempts by the establishment to discredit critics have backfired. A major problem for the establishment is that they themselves do not know how big the crisis is and which findings are solid.  Consequently, there has been no major initiative by the establishment to mount replication projects that provide positive evidence for some important discoveries in psychology.  Looking forward to 2017, I anticipate no major changes. Several registered replication studies are in the works, and prediction markets anticipate further failures.  For example, a registered replication report of “professor priming” studies is predicted to produce a null-result.

professor-priming-prediction

If you are still looking for a New Year’s resolution, you may consider signing on to Brent W. Roberts, Rolf A. Zwaan, and Lorne Campbell’s initiative to improve research practices. You may also want to become a member of the Psychological Methods Discussion Group, where you can find out in real time about major events in the world of psychological science.

Have a wonderful new year.

 

 

Z-Curve: Estimating Replicability of Published Results in Psychology (Revision)

Jerry Brunner and I developed two methods to estimate replicability of published results based on test statistics in original studies.  One method, z-curve, is used to provide replicabiltiy estimates in my powergraphs.

In September, we submitted a manuscript that describes these methods to Psychological Methods, where it was rejected.

We now revised the manuscript. The new manuscript contains a detailed discussion of various criteria for replicability with arguments why a significant result in an exact replication study is an important, if not the only, criterion to evaluate the outcome of replication studies.

It also makes a clear distinction between selection for significance in an original study and the file drawer problem in a series of conceptual or exact replication studies. Our methods only assumes selection for significance in original studies, but no file drawer or questionable research practices.  This idealistic assumption may explain why our model predicts a much higher success rate in the OSC reproducibility project (66%) than was actually obtained (36%).  As there is ample evidence for file-drawers with non-significant conceptual replication studies, we believe that file-drawers and QRP contribute to the low success rate in the OSC project. However, we also mention concerns about the quality of some replication studies.

We hope that the revised version is clearer, but fundamentally nothing has changed. Reviewers at Psychological Methods didn’t like our paper, the editor thought NHST is no longer relevant (see editorial letter and reviews), but nobody challenged our statistical method or the results of our simulation studies that validate the method. It works and it provides an estimate of replicability under very idealistic conditions, which means we can only expect a considerably lower success rate in actual replication studies as long as researchers file-drawer non-significant results.