#### Introduction

It was interesting to see a neurologist defending TPA for stroke on Emergency Medicine Literature of Note this week. As always, the discussion rapidly focused on the NINDS trial, and whether it requires replication.

Medicine continues to be plagued by poorly reproducible studies. The storyline is familiar. First, a very positive study is released in a major medical journal, with great fanfare. This leads to widespread changes in practice. Decades later, it becomes clear that the study was incorrect.

The reason for these mistakes may relate in part to misconceptions regarding the *p*-value. It is often assumed that a low *p*-value indicates that the study is reproducible (i.e. a “real” result, not a fluke). Unfortunately this isn't true.

Recently a new tool was developed to help understand the reproducibility of clinical studies: the *fragility index*. This post will analyze the NINDS trial from the perspective of its fragility index.

### Fragility index

#### Introduction to the fragility index

The fragility index is a way to understand how statistically reproducible a study is. The index is equal to the number of patients who would need to have a different outcome in order to cause the *p*-value to increase above 0.05. For example, imagine that a small study was performed with a *p*-value of 0.048. This would be accepted as “statistically significant.” However, if only *one patient* experienced a different outcome, then the *p*-value would climb above 0.05. This study would then have a *fragility index* of one.

A low fragility index may help us to identify studies which are poorly reproducible, possibly occurring as a statistical fluke. Such studies are barely sneaking beneath the *p*=0.05 cutoff. If only a few patient outcomes had turned out differently, the study would have been negative.

#### Determining the fragility index of the NINDS trial

(For those not familiar with the NINDS trial, an overview is here.)

The main results of the NINDS trial are shown below. These are analyses of patient outcomes measured using four different scales. The data is presented in a somewhat repetitive fashion, which might create the misperception that there are four confirmatory data sets. However, there was only a *single* group of patients, being analyzed simultaneously using four different metrics (1).

From the manuscript and from more precise data available on the manufacturer's website, the number of patients in each subgroup can be calculated:

The fragility index is obtained by repeatedly calculating the *p*-value using a Fisher exact test, while successively moving one patient at a time in the control group from the “poor outcome” group into the “favorable outcome” group. The fragility index is the minimal number of patient outcomes which must be changed for the *p*-value to increase above 0.05:

The fragility index using most assessment scales is three (although based on the Modified Rankin Scale, the fragility index would be four). The fragility index for the *entire study* is three, because if three outcomes were changed then the study would have been negative based on the authors own criterion (the manuscript indicates that “we required that the results of all four outcome measures be similar”).

### Instability index

#### Introduction to the instability index

I will define the *instability index* as an estimate of the imprecision in the outcome variable (2):

Instability index = (Effects from baseline imbalance) + (Post-randomization crossover) + (Loss to follow-up)

This is the sum of three components:

- Effects on the outcome due to baseline imbalance between the control and intervention groups (e.g. from uneven randomization). It is impossible to know the precise effect that baseline differences will have, so this requires estimation.
- Post-randomization crossover: the number of patients who are crossed over after randomization, or are exposed to other significant protocol violations.
- The number of patients lost to follow-up.

The fragility index may be understood in clinical context by comparing it to the instability index (equations below). The fragility index describes how robust the statistical analysis is to changes in the study outcome. The instability index describes how much error there might be in the study outcome. When error (instability index) outweighs the robustness of the statistical analysis (fragility index), then it's possible that the study is positive due to a statistical fluke:

**{**Instability index > Fragility index**}** –> suggests poor reproducibility

**{**Instability index < Fragility index**}** –> suggests good reproducibility

To illustrate this concept, consider two imaginary studies which both have a fragility index of five:

- Study #1: This study compared the peak pressures resulting from two different modes of mechanical ventilation. Each patient was exposed to both ventilator modes in random order. There were no protocol violations or losses to follow-up. There were no randomization issues (each patient served as their own “control”).
- Study #2: This was an RCT whose outcome is mortality. Ten patients crossed over after randomization and 15 patients were lost to follow-up. Based on differences in baseline APACHE scores, it is estimated that 3 more patients in the experimental group would survive compared to the control group (regardless of any intervention).

Although both of these studies have the same fragility index, study #1 is more reproducible. It has a fragility index of 5, but an instability index of zero. This study is perfectly conducted with little room for the introduction of nonrandom error. In contrast, study #2 has a fragility index of 5, but an instability index of 33. This study is considerably less reproducible, because imprecision in the results could easily overcome the fragility index and change the outcome of the study.

#### The instability index of the NINDS trial

Instability index = (Effects from baseline imbalance) + (Post-randomization crossover) + (Loss to follow-up)

__Effects from baseline imbalance:__ Patients in the tPA group had less severe infarctions at baseline (with 6% fewer large-vessel infarctions and 5% more small-vessel infarctions compared to the placebo group). The probability of a good outcome is ~30% higher for patients with a small-vessel infarctions compared to a large-vessel infarctions (Schmitz 2016). Therefore, this baseline imbalance should cause the tPA group to have ~2.5 more patients with good outcome (3):

Estimated difference in good outcomes = (30%)(5% frequency difference)(~166 sample size)

Estimated difference in good outcomes = 2.5

__Post-randomization crossover__: The compliance with the protocol was good (93% of patients received a full dose of study medication), without any clear cross-over between groups. Thus, this term may be set equal to zero.

__Loss to follow-up__: Primary outcome measures were missing for four patients.

__Calculating the instability index__:

Instability index = (Effects from baseline imbalance) + (Post-randomization crossover) + (Loss to follow-up)

Instability index = 2.5 + 0 + 4 = 6.5

### Predicted reproducibility of the NINDS trial

Analysis of the NINDS trial predicts that it may be a poorly reproducible study. It has a fragility index of three, meaning that if only three patients in the control group had a better outcome then it would have been a negative study. This low fragility index by itself is concerning with regards to reproducibility.

The fragility index of three is even worse when considering baseline differences in stroke severity between the tPA and control groups. Patients in the control group had more severe strokes at baseline. This difference could translate into an effect size of nearly three, threatening the validity of the study.

Finally, data from four patients was lost upon follow-up. Taking all sources of imprecision into account, the study has an instability index which is greater than its fragility index of three. This makes it possible that the study isn't reproducible (but, rather, that the results were a statistical anomaly due in large part to unequal randomization).

### Actual reproducibility of the NINDS trial

The NINDS trial is inconsistent with the majority of similar studies regarding thrombolytics in ischemic stroke, most of which are neutral or demonstrate harm (reviewed here). This should come as little surprise. The most logical conclusion is that the bulk of studies were correct: there was no substantial benefit from alteplase in ischemic stroke. The NINDS trial is the outlier – a fragile study with poor reproducibility, which may reflect a statistical fluke.

- The NINDS trial has a fragility index of three, meaning that if three additional patients in the control group had a good outcome then the study would have been negative.
- A low fragility index suggests that the study may be poorly reproducible.
- Baseline differences between groups and data lost upon follow-up could easily have caused a shift in the outcome of three or more patients. Since this imprecision exceeds the fragility index, it is plausible that the results of this study were falsely positive due to a statistical fluke.

##### Related:

- Demystifying the p-value (see #4, reproducibility) (PulmCrit)
- tPA debate between Dr. Swaminathan and Dr. Jagoda (EMCrit)
- Ischemic stroke treatment archive (REBEL EM)
- tPA policy with Dr. Jerome Hoffman (FOAMcast)
- Thrombolysis for stroke (theNNT)
- The secret of NINDS (Skeptics Guide to EM)

Conflicts of Interest: I don't like intracranial hemorrhages.

##### Notes:

- In addition to these four scales, there is a fifth statistic (the “Global test” listed in the table) which is a conglomerate of the four different scales. For the purpose of calculating the fragility index, I will focus on the four individual scales and ignore the “Global test” summary statistic. The global test statistic is extremely complicated and beyond my ability to calculate. However, since this is a summary statistic of the four individual scales, it should reveal information that is consistent with evaluation of the scales themselves. Indeed, if the global test statistic was to give us a
*different*result compared to evaluation of the scales themselves, then that would call into question the validity of the global test statistic itself. - Yeah, I just made this up. The concept of comparing fragility index to patients lost at follow-up has been proposed by Paul Young here. This instability index is essentially an extension of this concept. There is probably a more clever way out there to do it. Of course, this has not been prospectively validated.
- This is only one way to approach this estimation, and of course this is merely an
*estimate*. I'm sure that there are more sophisticated approaches to estimating this, but ultimately this is an*estimate*regardless of how it is calculated.

Image credits: https://en.wikipedia.org/wiki/2013_Dar_es_Salaam_building_collapse

### Josh Farkas

#### Latest posts by Josh Farkas (see all)

- PulmCrit- Epinephrine vs. atropine for bradycardic periarrest - February 13, 2017
- PulmCrit- Six myths promoted by the new surviving sepsis guidelines - January 30, 2017
- PulmCrit- How to convert a VBG into an ABG - January 16, 2017

Scott Weingart says

just brilliant!!

Drew M says

When you say “The fragility index is the minimal number of patient outcomes which must be changed for the p-value to increase above 0.05” does this mean either a better outcome in the placebo group or a worse outcome in the intervention group? Also, I would like to add that I really enjoy your work, please keep it up.

Josh Farkas says

Thanks. Studies and calculators determine the fragility index by adding events to the group with a smaller number of events (whether that may be the control or experimental group).

You could arguably do it the other way around (i.e. subtracting events from the group with a higher number of events), an approach which makes sense and is probably equivalent. However you may as well stick to the standard approach (adding to the lower group).

This is how I calculated the Fragility index for this post:

– put the data into an online Fisher Exact calculator (e.g. http://graphpad.com/quickcalcs/contingency1.cfm)

– shift events to the group with a smaller number of events one at a time, calculating the p-value each time

– the number of events that must be changed to generate p > 0.05 is the fragility index

– double-check the results with the fragility calculator http://www.fragilityindex.com

You could of course just start off with the fragility calculator, but there is something strangely satisfying about iteratively calculating the p-values with the Fisher Exact calculator.

Elliott Ridgeon says

This is really interesting, and the idea of instability vs fragility is illuminating! Glad to see fragility catching people’s attention.

About p-values: if you have a look at our fragility paper, you’ll see that fragility and p-values are correlated, but not perfectly; there are studies with superficially ‘good’ p-values well below 0.05 that turn out to be very fragile indeed.

Finally, http://www.fragilityindex.com will calculate fragility for you!

Josh Farkas says

Thanks for commenting and also thanks for an excellent paper which I recommend everyone should read (for those who haven’t, the reference is here: http://www.ncbi.nlm.nih.gov/pubmed/26963326).

I absolutely agree that there is a relationship between p-values and fragility (figure 3 in your paper), but I don’t think its as close a relationship as most people believe. The missing component is probably sample size, as illustrated in your paper (Figure 2). Thus, if you have a study with a large sample size and a very small p-value then it’s fragility index is probably high. However, if the sample size is small, or if the p-value is marginal – all bets are off.

Thanks for creating the Fragility Calculator, it’s fantastic.

(Deep down, I would love to see medical statistics transition to a Bayesian system. Bayesian statistics might eliminate many of the problems with fragility and p-values which we are plagued with. However, from a pragmatic standpoint it is unlikely that this could happen in the near-future. So, if we’re stuck with frequentist statistics, then adding a fragility index to the p-value is a big step in the right direction.)

Ryan Radecki says

Fun thought experiment!

It would seem you could think about the Fragility Index almost as a 95% CI for the p-value.

My main issue with the FI is foundational in the sense it has married itself to the p-value, and inherits some of its flaws in representation. A better FI would take into account not only the underlying statistics of the trial at issue, but also the likelihood of the result in the context of the prior evidence (i.e., let’s get Bayes-y!). Plenty of trials may be small enough their results appear “fragile”, but if they are consistent with multiple other trial results, to characterize them as such could be misleading.

Josh Farkas says

Hi Ryan,

As usual I agree with you.

I agree that just because a trial is fragile certainly doesn’t mean that it should be ignored. Every trial must be considered within the context of prior clinical evidence, laboratory data, risks, benefits, alternative therapies, clinical experience, etc. The level of evidence required to support TPA in stroke is far greater than the level of evidence required to justify choosing one fluid over another fluid, or one diuretic over another diuretic.

I am looking into a methodology to bring Bayesian analysis onto the blog (easier said than done). The popularity of the fragility index is a promising sign that the p-value may be losing its stranglehold over medical statistics and thought. My opinion has been that the primacy of frequentist statistics is one of the greatest dogmas of modern science, one which eventually must be overturned. However, achieving this will require building bridges between Bayesian statisticians and clinicians which currently don’t exist.

Christopher Zammit says

Josh,

Interesting Post. Thanks for sharing! I can see why this might make folks consider the narrow margins of absolute difference that define what we consider to be statistically significant. But wouldn’t you argue that trials are designed to have fragile results? When one designs a trial, a sample size is selected that will produce statistically significant findings based on a hypothesized absolute difference in the primary outcome between the control and experimental groups. The trialist don’t select a sample size that will exceed a significant result. The fragility index can be argued to represent the difference between what the authors hypothesized to be the difference and the excess difference that was seen. Heck, we could calculate an “inverse fragility index” (the number of outcome differences that need to occur to turn a result from “insignificant” to “significant”).

I’m not sure we should be worried about a “fragile result”. If the result is fragile, it seems to just suggest that the researcher’s hypothesized treatment effect was close to what was produced in the experiment. If the results exceed the hypothesized difference (low fragility index), it would be more reasonable to assume that the “low fragility” was spurious and the true treatment effect is likely less than what was seen in the trial, assuming the pre-trial hypothesized differences were well informed. Further, I think we have to take into account that no trial is designed with “100% power”. Most trials are designed with a 5 to 20% probability that a lack of difference between the groups will be the result of under-powering, and not a proven lack of “significant” treatment effect (i.e. a power of 80 to 95%).

Bottom line, I don’t think we should expect for efficient, well-planned, informed, and appropriately powered RCTs to have results that aren’t “fragile”. IMHO, if the results are not “fragile”, then we enrolled too many patients testing that hypothesis and those resources would be better served testing other important clinical questions.

Thanks again! I’m a huge fan of your blog. Very intriguing post!

Chris

Chris Zammit, MD

University of Rochester

Neurointensivist and Emergency Physician