A post a few weeks ago calculated the fragility index of the NINDS trial (which turned out to be only three). Very briefly, the fragility index tests how many events would need to be changed for the p-value to increase above 0.05, rendering the study “statistically insignificant.” Ryan Radecki commented that he was concerned that the fragility index was married to the p-value, thereby inheriting the flaws of frequentist statistics. Perhaps we should ditch the p-value and the fragility index, switching instead to a purely Bayesian approach to statistics?
This post will attempt to answer these questions, by exploring the relationship between the fragility index, p-values, and Bayes Factors for a 2×2 table.
Frequentist statistics is based upon calculating the probability of the observed data occurring if the null hypothesis were true (the p-value). The problem with this approach is that it tells us little about the probability of the experimental hypothesis being true (if this doesn't make sense, start with this post).
Bayesian statistics takes a broader approach. In Bayesian statistics, the probability of the observed data occurring is determined both under the null hypothesis and also under the experimental hypothesis (figure below). The Bayes Factor is the ratio of these probabilities. This Bayes Factor also functions as a likelihood ratio to calculate the post-experiment odds that the experimental hypothesis is true.
Bayesian statistics is a more robust approach, but this comes at a price. Bayesian statistics haven't yet been broadly applied due to several drawbacks:
One way to render Bayesian Statistics a bit more objective and within the reach of clinicians is to calculate the maximal Bayes Factor that could be obtained for any experimental hypothesis (1). By definition, this is a generously high Bayes Factor which will favor the experimental hypothesis. As such, the maximal Bayes Factor can be useful for defining the limits of what the study cannot prove (2). For a 2×2 table, the maximal Bayes Factor is calculated with this equation (where x is equal to the value of the chi-test statistic)(3).
For a 2×2 table, the p-value can be calculated using a chi-square distribution with one degree of freedom (4). Since both the p-value and the maximal Bayes Factor are calculated from the chi-squared statistic, they can be directly plotted against each other:
A few observations are in order:
The close relationship between the maximal Bayes Factor and the p-value implies that Fragility must be related to both. The key here is really how robust the chi-squared test statistic is to changes in the data (because both the p-value and the maximal Bayes Factor ultimately depend on the chi-squared parameter). The chi-squared statistic tends to be more labile when fewer numbers of events are observed, leading to fragility in both the p-value and the maximal Bayes factor.
Consider two studies which both yield a Bayes Factor of 6. In the first study, if the outcome of one patient is changed, the Bayes Factor decreases from 6 to 4. In the second study, if the outcome of one patient is changed, the Bayes Factor decreases from 6 to 5.8. Clearly the first study is more “fragile” than the first study.
Bayesian statistics doesn't yield a single binary outcome (i.e. p<0.05) the way that Frequentist statistics does. Thus, it may be less clear how exactly to quantify Fragility in a Bayesian statistical system. One approach could be the deviation in the Bayes Factor if a single patient outcome is changed.
Regardless of exactly how it is expressed, some description of Fragility should be useful regardless of which statistical approach is being utilized (Bayesian or Frequentist). Furthermore, the fragility of a dataset should be similar regardless of which statistical approach is being utilized. For example, if a study is fragile using a Frequentist analysis, it should also be fragile using a Bayesian analysis.
Current guidelines recommend lung cancer screening with CT scans, based almost exclusively on a single trial: the NLST trial from 2011. This was a prospective multi-center RCT which randomized 53,454 patients to receive lung cancer screening with chest X-rays vs. CT scans.
It's controversial whether the endpoint of screening trials should be disease-specific mortality or all-cause mortality (Penston 2011). Screening often manages to reduce disease-specific mortality, without meeting the more robust outcome of reducing all-cause mortality. The NLST trial chose disease-specific mortality as its primary endpoint. The results were interpreted as showing a reduction in both disease-specific and all-cause mortality, providing strong support for the implementing CT screening:
Let's evaluate the data from this study regarding whether lung cancer screening reduces all-cause mortality. The following 2×2 table relates lung cancer screening with mortality:
The chi-square statistic for this study is 4.16, yielding a p-value of 0.041 (5). This p-value is below the standard cutoff of 0.05, so the traditional interpretation of this data is that it shows a statistically significant reduction in all-cause mortality.
Fragility Index for all-cause mortality is five (e.g. using the Fragility Index Calculator).
The instability index can be calculated as follows:
Instability Index = (Effects from baseline imbalance) + (Post-randomization crossover) + (Loss to followup)
Overall the instability index might be estimated to be ~2700 patients. Given that the instability index is far greater than the fragility index, the finding that CT scans improve all-cause mortality appears quite fragile.
These results have a maximal Bayes Factor of 9.9. Here is where things get tricky. In order to determine the probability that CT scans reduce all-cause mortality, we must first determine the pre-test probability that this is true. This is often impossible, so typically a few pre-test probabilities are used. In this case, the following may be used:
Based on a Bayes Factor <9.9 and these pre-test probabilities, we can now calculate the post-test probabilities:
Even with the most generous prior probability, the post-test probability must be below 91%. Thus, this study cannot establish that an all-cause mortality benefit exists with >95% certainty. More realistically, the post-test probability is probably about 60-80%, so this data isn't close to being definitive. Thus the widespread claim that lung cancer screening “saves lives” may not be true (6).
The fragility index and Bayesian analysis suggest that medical evidence isn't nearly as robust as we thought. For example, clinical studies which are considered the bedrock of modern medicine (e.g. the NINDS trial) may actually be fragile and non-definitive. Smaller trials are often worse. It quickly becomes clear why medicine continues to suffer from reversals in practice.
This may be disconcerting. However, from the standpoint of any individual practitioner, it may not make a huge difference. We must address clinical scenarios with the best evidence available. Whether that evidence is perfect or imperfect, we don't have the luxury to wait for better evidence. In this context, it is often sensible to base therapy on treatments which are fragile or have borderline statistical significance.
The situation is reversed when considering evidence-based guidelines. Premature acceptance of an ineffective therapy into guidelines may have profoundly negative consequences. First, this discourages further research into the therapy, leading to inertia. Second, physicians and hospitals may feel obligated to adhere to this therapy which is suddenly considered the “standard of care.” Finally, future therapies may be tested in addition to, rather than in comparison to the ineffective therapy. Eventually this creates a complex treatment regimen resembling a house of cards.