#### Introduction: the IOTA trial

You’ve probably heard of the IOTA trial by now.^{1} It’s a meta-analysis in Lancet comparing the effect of conservative versus liberal oxygen therapy on survival in critical illness (details here). Its conclusions are bold and exciting:

Reviews of the study were uniformly positive. The accompanying editorial recommended that the study mandate immediate practice change:^{2}

I agree with the study’s results (hyperoxia is aphysiologic and probably dangerous). However, we need to recognize that this is a very fragile analysis. This should open our eyes to problems endemic in meta-analyses.

### Patient-level fragility: The fragility index

The *fragility index *is the number of outcomes which would need to be changed in order for a study to lose statistical significance (causing the *p-*value to rise above <0.05).^{3} A low fragility index indicates a lack of robustness. For example, the NINDS trial has a fragility index of three.^{4} If three outcomes had been different, the results would lose statistical significance and the trial would have been regarded as a “negative” study.

The fragility index is easy to calculate for a study with a binary endpoint (e.g. using a fragility index calculator here). However, calculating the fragility index for a meta-analysis is a bit murky. To my knowledge it hasn’t been done. But yeah, we’re gonna go there.

The first step to test the fragility of a meta-analysis is to replicate the calculations involved. The IOTA trial used RevMan 5.3, which is freely available software produced by the Cochrane Community. Following methods described in the paper, I reproduced their calculations on my computer (screenshot below). Every number matches exactly to the published manuscript, except for a typo in the Lancet.

Now the fun starts. Let’s suppose that *three *additional patients in Young 2014 had died. This causes the whole meta-analysis to become negative:

The fragility varies between studies. For each individual study, I increased the number of deaths in the conservative oxygen group until the *p-*value was no longer <0.05:

So overall, it would probably take about 5-9 patients with a different outcome to make the meta-analysis negative. These flipped outcomes could occur within a single study, or across multiple studies. For example, adding one bad outcome to the conservative oxygen group in the first nine studies would nullify the meta-analysis:

Different outcomes among patients in the liberal oxygen group could also affect the outcome. For example, the study could be nullified by three fewer deaths in the liberal oxygen group plus four additional deaths in the conservative oxygen group:

It’s difficult to put an exact number on the fragility index, but it’s apparently <10. Patients whose outcomes could be changed to nullify the study include patients in the conservative group who survived (7857 in total) and patients in the liberal group who died (283 total) – for a grand total of 8,142 patients. It’s easy to imagine that technical problems (e.g. protocol violations) could cause a shift of <10 outcomes among 8,142 patients.

When we read a meta-analysis containing thousands of patients, we naturally *assume *that increasing the number of patients involved will increase the robustness of the analysis. Unfortunately, this often isn't true. If positive and negative studies mostly cancel each other out, then the fragility index of the meta-analysis remains low. Piling on more studies increases the “noise” without increasing the “signal” – with the ultimate result being a huge meta-analysis with a relatively low fragility index (low signal/noise ratio).

### Study level fragility: The metafragility index

Another way to gauge fragility is at the level of individual *studies *which compose the meta-analysis. A meta-analysis is essentially a method of weighing the combined evidence from numerous studies regarding a therapy:

A robustly positive meta-analysis should include lots of studies which are strongly positive. As such, you could remove any individual study and it wouldn’t affect the overall conclusion of the meta-analysis:

Alternatively, let’s imagine a very *weakly *positive meta-analysis shown below. Each of the five positive studies only weakly supports the intervention. Combined, these five weak studies are able to tip the balance in favor of the intervention. However, if we remove *any *of these studies, then the meta-analysis is no longer positive! The validity of the meta-analysis depends on the validity of every single study – making the meta-analysis extremely fragile.

This leads to what I will define as the *metafragility index*:

- The overall meta-analysis results are re-calculated in the absence of each individual study, to determine if the meta-analysis depends on that study to achieve
*p*<0.05. - The
*metafragility index*equals the number of studies which, when removed individually, cause the meta-analysis to have a negative result (*p*>0.05). For example, in the robust meta-analysis shown above, the metafragility index is zero (any study can be removed and the meta-analysis remains positive). Alternatively, in the fragile meta-analysis shown above, the metafragility index is five (if any of five studies are removed, the*p*-value increases >0.05).

A metafragility index greater than zero reveals hidden weakness in the meta-analysis. When we read a meta-analysis, we generally assume that “the whole is greater than the sum of the parts” – that is, the results of the meta-analysis reflect a *melding* of data from *all *of the studies. As such, we expect that the meta-analysis should be more robust than any individual study. If the metafragility index is above zero, this disproves the notion that the meta-analysis is stronger than any individual study.

When the removal of just one or two studies has a considerable impact on the conclusions from a meta-analysis, then these conclusions must be stated with some caution – Viechtbauer W & Cheung MWL

^{5}

If the metafragility index is over zero, the next question is *which *studies the meta-analysis is dependent upon. A chain is only as strong as its weakest link. Likewise, the meta-analysis is only as robust as the *weakest* study upon which it depends. For example, if a meta-analysis is dependent on a large, robust, multi-center RCT – that might be acceptable. Alternatively, if the meta-analysis is dependent on a small and poorly executed trial, that’s deeply problematic.

IOTA has a metafragility index of *two*, being dependent on NCT00414726 and Giradis 2016.^{6} For example, if we remove the Giradis trial, the overall meta-analysis becomes negative (*p*=0.14):

A nice way to visualize this data is a metafragility plot (see below). Blue lines constitute a traditional Forrest plot (each representing the 95% confidence interval of the relative risk based on data from that individual trial). Underneath each study is a black line which shows the results of performing a meta-analysis using *all *the trials *except *for that specific trial:

The meta-fragility plot helps illustrate the effect that each study has on the overall meta-analysis. Removal of any study would cause the meta-analysis to shift in the *opposite *direction. For example, Stub 2012 was a relatively left-lying study (figure below). *Removal *of this study shifts the resulting meta-analysis to the *right*. The amount of right-ward shift when this study is removed reflects how much impact this study has on the overall meta-analysis.^{7}

The plot also functions as a graphical display of the metafragility index. If a trial is required for the meta-analysis to be positive, then the meta-analysis performed without it (the black line) will have a 95% confidence interval that crosses one. For IOTA, which has a metafragility index of two, this occurs twice:

A study can have a high impact on a meta-analysis for one of two reasons: it may be a very large trial (with robust results and a large weight), or it may be an outlier. IOTA depends on NCT000414726 trial, which is a *small*, *outlying *trial. NCT000414726 isn’t a robust trial, but because it is an outlier it exerts a strong pull on the meta-analysis results. This study wasn’t ever published, raising major concerns about systemic bias or other flaws. In fact, the published IOTA manuscript judged this study to be at *high risk of bias* (see quoted text below). The IOTA meta-analysis is quite fragile because it depends entirely on this trial which is small, outlying, unpublished, and potentially biased.

### What about longer-term mortality?

This post focuses on in-hospital mortality, because that is the analysis which the authors focused on in the paper. It also happens to be the most impressive results of the trial. Over time, mortality benefit rapidly evaporates:

Over time the effect size decreases, causing the fragility to balloon. At longest follow-up, the mortality difference has a patient-level fragility index of merely *one patient*:

The meta-fragility index is 10, which is basically the number of positive-leaning studies. Thus, removal of nearly any positive data from the meta-analysis will cause a loss of statistical significance. This reveals profound fragility, which is consistent with a patient-level fragility index of one. Overall this result is as fragile as is statistically possible (if it were any weaker, it would become flat-out negative).

### One step back: *p*<0.05 is actually a low bar to clear

The above analysis focuses on whether or not the meta-analysis could reach a *p*-value <0.05. However, it should be borne in mind that this isn’t a robust *p*-value cutoff. A *p*-value of ~0.02-0.05 is roughly equivalent to a likelihood ratio of ~3 and thus constitutes a *weak *level of evidence. This has led many statisticians to recommend lowering the *p*-value cutoff to <0.005. Although *p*<0.05 currently remains the conventional cutoff for “statistical significance,” meeting this criterion is no guarantee that the study results are valid.

### Bigger picture: Our overblown opinion of meta-analyses

Evidence-based medicine traditionally teaches that meta-analysis is the highest form of evidence, sitting atop the pyramid of evidence. However, it’s easy to find meta-analyses which contradict each other (e.g. this pair disagrees about PE^{8} ^{9} while this pair of fresh meta-analyses disagrees about steroid in sepsis.^{10} ^{11}) Both pairs of *dueling meta-analyses *were published almost simultaneously, yet reached opposite conclusions! These examples disprove the *replicability *of meta-analysis: if meta-analysis were a reliable tool then different groups should always obtain the same result. And of course, anything that isn’t replicable isn’t scientific.

The inconvenient truth is that meta-analyses suffer from a host of problems that are poorly appreciated (e.g. fragility, heterogeneity, obsolete studies, underpowering). The statistical complexity of meta-analysis baffles us into submission, because we lack any tools to cut below the surface of the analysis.

It’s time to stop considering meta-analyses as the highest form of evidence. The highest form of evidence is simply a RCT. We would be best served by carefully reading RCTs and then reaching an over-arching conclusion from a thoughtful integration of the RCTs themselves (based on the weaknesses, strengths, methodologies, results, and patient populations of each study). Integrating studies ourselves is harder than delegating this task to a meta-analysis, but ultimately will get us closer to the truth.

- Meta-analyses are widely assumed to be robust, without any attempt to test their fragility.
- This post describes two techniques to evaluate the fragility of a meta-analysis. These techniques do require a meta-analysis program, but this can be freely obtained via the Cochrane Community.
- The
*fragility index*can be estimated by determining how many patient outcomes in each study need to be changed until the*p*-value is no longer <0.05. This number will vary somewhat between studies, but it still provides a general concept of the fragility of the meta-analysis. - The
*meta-fragility index*is the number of studies which, when individually removed from the analysis, cause the meta-analysis to lose statistical significance. Ideally, a robust meta-analysis should have a meta-fragility index of zero (any study could be removed and the remaining studies would still yield a positive meta-analysis). Alternatively, if the meta-fragility index is above zero then the overall meta-analysis is entirely dependent on one or more studies – so the conclusions should be interpreted cautiously. - These tests are applied to the IOTA meta-analysis, a widely acclaimed meta-analysis recently published in the Lancet. IOTA has an eye-opening amount of fragility. Even the most robust outcome (in-hospital mortality) has a fragility index of <10 patients and a meta-fragility index of two. One of the trials which IOTA depends on is a small, outlying study which was never published.
- The dogma that meta-analyses are the most reliable form of evidence needs to be challenged.

##### Links

- FOAMed on IOTA
- Related posts on methodology
- Fragility index of NINDS & ECASS-III
- .05 isn’t a good p-value cutoff
- All PulmCrit methodology posts here

- RevMan5 Download with thanks to the Cochrane Community for developing this software and making it freely available.

*Lancet*. 2018;391(10131):1693-1705. [PubMed]

*Lancet*. 2018;391(10131):1640-1642. [PubMed]

*Crit Care Med*. 2016;44(7):1278-1284. [PubMed]

*Res Synth Methods*. 2010;1(2):112-125. [PubMed]

*JAMA*. 2016;316(15):1583-1589. [PubMed]

*JAMA*. 2014;311(23):2414-2421. [PubMed]

*J Thromb Haemost*. 2014;12(7):1086-1095. [PubMed]

*Crit Care Med*. 2018;46(9):1411-1420. [PubMed]

*Intensive Care Med*. 2018;44(7):1003-1016. [PubMed]

### Josh Farkas

#### Latest posts by Josh Farkas (see all)

- IBCC chapter & cast:Abdominal Compartment Syndrome - March 14, 2019
- PulmCrit-DEXACET:Four grams of acetaminophen a day keeps the delirium away? - March 11, 2019
- IBCC chapter & cast:Myasthenic Crisis - March 7, 2019

Great post!

Chris

Of interest :

The Fragility Index: a P-value in sheep’s clothing? Rickey E. Carter1,2*, Paul M. McKie3 and Curtis B. Storlie1,2

European Heart Journal (2017) 38, 346–348 EDITORIAL doi:10.1093/eurheartj/ehw495

Free pdf access at :

https://goo.gl/qCSRmX

Oops,I forgot: some argue that the Fragility Index doesn’t add to the p values and confidence intervals. But while it intuitively makes sense to me I don’t get the mathematical point. Maybe the FI just better illustrates the vanity of a “positive” p value or a confidence interval.

Any time we obtain *any* sort of measurement in science, two things are important: 1) the measurement itself 2) the degree of error regarding the measurement. For example, suppose that I told you that liberal oxygen increased mortality with a relative risk of 1.21. Your response would probably be “what is the 95% confidence interval?”. The measurement 1.21 by itself has no meaning – it must be paired with some sort of measurement of random error in order for us to make sense of it. The fragility index is basically a measurement of the amount of random error in the p-value. It tells us how robust the p-value is. There are lots of problems with the p-value! I’ve discussed this in the past on the blog here (https://emcrit.org/pulmcrit/demystifying-the-p-value/). The fragility index provides some insight about how reproducible the p-value is (problem #4 in that prior blog). The fragility index doesn’t solve all problems with the p-value, or even come close to doing so. Nobody ever claimed that the fragility index would solve every problem in frequentist statistics. Furthermore, nobody ever claimed that the fragility index should replace the p-value or other statistics – instead it is best to carefully interpret… Read more »

Thanks Chris!

Chris

The fragility indices present a counterfactual of sorts, it seems. Converse to what you illustrate, why not take an M-A with p<0.07 and ask what would need to shift to produce a p<0.05. Yes, a p-value is an artificial construct, but a priori if we run the analysis as proposed, the results are what they are, no?

Also, I recall seeing a paper a decade ago along the lines of Meta-Analysis, Shmeta-Analysis. I didn't know it supplanted RCTs at the top of the pyramid.

https://www.ncbi.nlm.nih.gov/pubmed/7977286

Brad

1) yes, you could absolutely use a meta-fragility index to test the robustness of a negative result (e.g. p=0.07). I think that would be a great idea. What it would likely show is that the meta-analysis, rather than being “negative” is actually “non-informative.”

2) the results are what they are, the fragility index just tells you how fragile they are. It’s not the be-all-end-all, but it gives you a concept of how reproducible the results are. Any result in science is fairly meaningless in the absence of an associated measurement of random variation/precision (e.g. fragility index, 95% confidence interval, or standard deviation).

Sorry, Josh. Got your name wrong. Grabbed wrong name above mine from the thread.

Brad

Bravo Josh. As always, great post!

It appears that the “fragility index” is basically the same as the Fail Safe N statistic, a fundamentally flawed technique that paradoxically suggests less fragility/bias the more bias there is in a meta-analysis: https://crystalprisonzone.blogspot.com/2016/07/the-failure-of-fail-safe-n.html Whether a meta-analysis will be robust to new studies or contexts is indicated by the confidence interval (how much second order sampling error is there in the mean effect—where might the mean move if we added more studies) and the credibility interval (how much heterogeneity is there in the effect—how much might the treatment effect vary across patient populations, settings, or other factors). Not all meta-analyses can draw end-all conclusions about a treatment, that’s true. A meta-analysis based on a small number of studies with few total patients and a lot of observed heterogeneity provides a noisy estimate and needs more studies before firm conclusions can be drawn. A lot of meta-analysis authors fail to adequately temper their conclusions based on the uncertainty in their results. But meta-analyses remain the single best method for integrating research findings. The best way to determine that more studies are needed is to meta-analyze the existing studies and observe wide confidence and credibility intervals. Narrative reviews of studies cannot easily… Read more »

1) Fragility & meta-fragility are quite different from fail-safe N. Meta-fragility explores the effect of *removing* data from each individual study, whereas fail-safe N evaluates the effect of *adding* null data. Meta-fragility has the ability to probe the effect of each study on the overall result, which is something that fail-safe N is entirely incapable of doing. Meta-fragility addresses a real-world question (what if this study hadn’t quite met inclusion to the meta-analysis?), whereas fail-safe N is more of a theoretical construction. 2) Confidence intervals/p-values are related to fragility, but they’re not the same. Probably the best illustration of this is Ridgeon’s paper in critical care medicine 2016 (https://www.ncbi.nlm.nih.gov/pubmed/26963326). See figure 3. There is a definite relationship between fragility and p-value but for individual studies this doesn’t aways work well. Thus fragility probably adds *some* additional information on top of traditional statistics (p-values and confidence intervals). Although I agree that an experienced statistician could probably make a fair guess about the fragility of a study simply based on the confidence intervals. 3) I don’t know about psychology, but in critical care the amount of high-quality evidence is generally low enough that a narrative review is fine. In order to perform… Read more »