You’ve probably heard of the IOTA trial by now.1 It’s a meta-analysis in Lancet comparing the effect of conservative versus liberal oxygen therapy on survival in critical illness (details here). Its conclusions are bold and exciting:
I agree with the study’s results (hyperoxia is aphysiologic and probably dangerous). However, we need to recognize that this is a very fragile analysis. This should open our eyes to problems endemic in meta-analyses.
Patient-level fragility: The fragility index
The fragility index is the number of outcomes which would need to be changed in order for a study to lose statistical significance (causing the p-value to rise above <0.05).3 A low fragility index indicates a lack of robustness. For example, the NINDS trial has a fragility index of three.4 If three outcomes had been different, the results would lose statistical significance and the trial would have been regarded as a “negative” study.
The fragility index is easy to calculate for a study with a binary endpoint (e.g. using a fragility index calculator here). However, calculating the fragility index for a meta-analysis is a bit murky. To my knowledge it hasn’t been done. But yeah, we’re gonna go there.
The first step to test the fragility of a meta-analysis is to replicate the calculations involved. The IOTA trial used RevMan 5.3, which is freely available software produced by the Cochrane Community. Following methods described in the paper, I reproduced their calculations on my computer (screenshot below). Every number matches exactly to the published manuscript, except for a typo in the Lancet.
So overall, it would probably take about 5-9 patients with a different outcome to make the meta-analysis negative. These flipped outcomes could occur within a single study, or across multiple studies. For example, adding one bad outcome to the conservative oxygen group in the first nine studies would nullify the meta-analysis:
Different outcomes among patients in the liberal oxygen group could also affect the outcome. For example, the study could be nullified by three fewer deaths in the liberal oxygen group plus four additional deaths in the conservative oxygen group:
It’s difficult to put an exact number on the fragility index, but it’s apparently <10. Patients whose outcomes could be changed to nullify the study include patients in the conservative group who survived (7857 in total) and patients in the liberal group who died (283 total) – for a grand total of 8,142 patients. It’s easy to imagine that technical problems (e.g. protocol violations) could cause a shift of <10 outcomes among 8,142 patients.
When we read a meta-analysis containing thousands of patients, we naturally assume that increasing the number of patients involved will increase the robustness of the analysis. Unfortunately, this often isn't true. If positive and negative studies mostly cancel each other out, then the fragility index of the meta-analysis remains low. Piling on more studies increases the “noise” without increasing the “signal” – with the ultimate result being a huge meta-analysis with a relatively low fragility index (low signal/noise ratio).
Study level fragility: The metafragility index
Another way to gauge fragility is at the level of individual studies which compose the meta-analysis. A meta-analysis is essentially a method of weighing the combined evidence from numerous studies regarding a therapy:
A robustly positive meta-analysis should include lots of studies which are strongly positive. As such, you could remove any individual study and it wouldn’t affect the overall conclusion of the meta-analysis:
Alternatively, let’s imagine a very weakly positive meta-analysis shown below. Each of the five positive studies only weakly supports the intervention. Combined, these five weak studies are able to tip the balance in favor of the intervention. However, if we remove any of these studies, then the meta-analysis is no longer positive! The validity of the meta-analysis depends on the validity of every single study – making the meta-analysis extremely fragile.
- The overall meta-analysis results are re-calculated in the absence of each individual study, to determine if the meta-analysis depends on that study to achieve p<0.05.
- The metafragility index equals the number of studies which, when removed individually, cause the meta-analysis to have a negative result (p>0.05). For example, in the robust meta-analysis shown above, the metafragility index is zero (any study can be removed and the meta-analysis remains positive). Alternatively, in the fragile meta-analysis shown above, the metafragility index is five (if any of five studies are removed, the p-value increases >0.05).
A metafragility index greater than zero reveals hidden weakness in the meta-analysis. When we read a meta-analysis, we generally assume that “the whole is greater than the sum of the parts” – that is, the results of the meta-analysis reflect a melding of data from all of the studies. As such, we expect that the meta-analysis should be more robust than any individual study. If the metafragility index is above zero, this disproves the notion that the meta-analysis is stronger than any individual study.
When the removal of just one or two studies has a considerable impact on the conclusions from a meta-analysis, then these conclusions must be stated with some caution – Viechtbauer W & Cheung MWL5
If the metafragility index is over zero, the next question is which studies the meta-analysis is dependent upon. A chain is only as strong as its weakest link. Likewise, the meta-analysis is only as robust as the weakest study upon which it depends. For example, if a meta-analysis is dependent on a large, robust, multi-center RCT – that might be acceptable. Alternatively, if the meta-analysis is dependent on a small and poorly executed trial, that’s deeply problematic.
IOTA has a metafragility index of two, being dependent on NCT00414726 and Giradis 2016.6 For example, if we remove the Giradis trial, the overall meta-analysis becomes negative (p=0.14):
A nice way to visualize this data is a metafragility plot (see below). Blue lines constitute a traditional Forrest plot (each representing the 95% confidence interval of the relative risk based on data from that individual trial). Underneath each study is a black line which shows the results of performing a meta-analysis using all the trials except for that specific trial:
The meta-fragility plot helps illustrate the effect that each study has on the overall meta-analysis. Removal of any study would cause the meta-analysis to shift in the opposite direction. For example, Stub 2012 was a relatively left-lying study (figure below). Removal of this study shifts the resulting meta-analysis to the right. The amount of right-ward shift when this study is removed reflects how much impact this study has on the overall meta-analysis.7
The plot also functions as a graphical display of the metafragility index. If a trial is required for the meta-analysis to be positive, then the meta-analysis performed without it (the black line) will have a 95% confidence interval that crosses one. For IOTA, which has a metafragility index of two, this occurs twice:
A study can have a high impact on a meta-analysis for one of two reasons: it may be a very large trial (with robust results and a large weight), or it may be an outlier. IOTA depends on NCT000414726 trial, which is a small, outlying trial. NCT000414726 isn’t a robust trial, but because it is an outlier it exerts a strong pull on the meta-analysis results. This study wasn’t ever published, raising major concerns about systemic bias or other flaws. In fact, the published IOTA manuscript judged this study to be at high risk of bias (see quoted text below). The IOTA meta-analysis is quite fragile because it depends entirely on this trial which is small, outlying, unpublished, and potentially biased.
This post focuses on in-hospital mortality, because that is the analysis which the authors focused on in the paper. It also happens to be the most impressive results of the trial. Over time, mortality benefit rapidly evaporates:
The meta-fragility index is 10, which is basically the number of positive-leaning studies. Thus, removal of nearly any positive data from the meta-analysis will cause a loss of statistical significance. This reveals profound fragility, which is consistent with a patient-level fragility index of one. Overall this result is as fragile as is statistically possible (if it were any weaker, it would become flat-out negative).
One step back: p<0.05 is actually a low bar to clear
The above analysis focuses on whether or not the meta-analysis could reach a p-value <0.05. However, it should be borne in mind that this isn’t a robust p-value cutoff. A p-value of ~0.02-0.05 is roughly equivalent to a likelihood ratio of ~3 and thus constitutes a weak level of evidence. This has led many statisticians to recommend lowering the p-value cutoff to <0.005. Although p<0.05 currently remains the conventional cutoff for “statistical significance,” meeting this criterion is no guarantee that the study results are valid.
Bigger picture: Our overblown opinion of meta-analyses
Evidence-based medicine traditionally teaches that meta-analysis is the highest form of evidence, sitting atop the pyramid of evidence. However, it’s easy to find meta-analyses which contradict each other (e.g. this pair disagrees about PE8 9 while this pair of fresh meta-analyses disagrees about steroid in sepsis.10 11) Both pairs of dueling meta-analyses were published almost simultaneously, yet reached opposite conclusions! These examples disprove the replicability of meta-analysis: if meta-analysis were a reliable tool then different groups should always obtain the same result. And of course, anything that isn’t replicable isn’t scientific.
The inconvenient truth is that meta-analyses suffer from a host of problems that are poorly appreciated (e.g. fragility, heterogeneity, obsolete studies, underpowering). The statistical complexity of meta-analysis baffles us into submission, because we lack any tools to cut below the surface of the analysis.
It’s time to stop considering meta-analyses as the highest form of evidence. The highest form of evidence is simply a RCT. We would be best served by carefully reading RCTs and then reaching an over-arching conclusion from a thoughtful integration of the RCTs themselves (based on the weaknesses, strengths, methodologies, results, and patient populations of each study). Integrating studies ourselves is harder than delegating this task to a meta-analysis, but ultimately will get us closer to the truth.
- Meta-analyses are widely assumed to be robust, without any attempt to test their fragility.
- This post describes two techniques to evaluate the fragility of a meta-analysis. These techniques do require a meta-analysis program, but this can be freely obtained via the Cochrane Community.
- The fragility index can be estimated by determining how many patient outcomes in each study need to be changed until the p-value is no longer <0.05. This number will vary somewhat between studies, but it still provides a general concept of the fragility of the meta-analysis.
- The meta-fragility index is the number of studies which, when individually removed from the analysis, cause the meta-analysis to lose statistical significance. Ideally, a robust meta-analysis should have a meta-fragility index of zero (any study could be removed and the remaining studies would still yield a positive meta-analysis). Alternatively, if the meta-fragility index is above zero then the overall meta-analysis is entirely dependent on one or more studies – so the conclusions should be interpreted cautiously.
- These tests are applied to the IOTA meta-analysis, a widely acclaimed meta-analysis recently published in the Lancet. IOTA has an eye-opening amount of fragility. Even the most robust outcome (in-hospital mortality) has a fragility index of <10 patients and a meta-fragility index of two. One of the trials which IOTA depends on is a small, outlying study which was never published.
- The dogma that meta-analyses are the most reliable form of evidence needs to be challenged.
- FOAMed on IOTA
- Related posts on methodology
- RevMan5 Download with thanks to the Cochrane Community for developing this software and making it freely available.
- IBCC chapter:Guide to APRV for COVID-19 - April 8, 2020
- PulmCrit Theoretical Post – The COVID Severity Index (CSI 1.0) - April 2, 2020
- PulmCrit wee – Why the SCCM/AARC/ASA/APSF/AACN/CHEST joint statement on split ventilators is wrong. - March 29, 2020