PulmCrit - It’s insane to keep using mortality as a primary endpoint in critical care trials

There, I’ve said it. That’s a bit of a bold statement, but it seems to be supported by the evidence.

failure to prove mortality benefit

A post in 2018 explored the difficulty of proving mortality benefit from any intervention. To summarize, there are many barriers to proving all-cause mortality benefit:

Mortality is decreasing over time (making it increasingly difficult to recruit an adequate sample size of dying patients). A quintessential theme in RCTs is that the observed mortality is lower than the predicted mortality, leading to underpowering.
Most patients are unlikely to see any change in mortality (most patients are either likely-to-die, or likely-to-survive at baseline; only a few patients are truly hanging in the balance).
Patients die for numerous reasons (many deaths will be totally unrelated to the intervention being tested).
We are desperately trying to keep patients alive (if therapies fail to cause clinical improvement, clinicians will step in and pick up the slack).
The intervention is delivered too late to affect outcomes (early intervention is generally believed to be important, but is often impossible within the confines of an RCT).
Many conditions are too rare to study (critically ill patients are uncommon to begin with – so uncommon conditions within that cohort become impossibly difficult to study)

In that post, I listed seven pharmacologic interventions proven to have mortality benefit. That was notably very few – but it still held onto a bit of promise. A fresh study by Santacruz et al. suggests that there are actually no pharmacological interventions which have shown mortality benefit in critical care multi-center RCTs.¹ Zero. Two things explain the discrepancy:

Many of the studies listed in my prior post weren’t truly critical care studies. For example, the post included a study of clopidogrel in 45,852 patients with MI – that’s more of a cardiology trial than an ICU trial.
Santacruz et al only considers multi-center RCTs. These studies are less likely to detect spurious results than single-center RCTs.

The difference between seven and zero in some ways is infinitely large. Let’s take a closer look at this trial:

interventions that improved mortality

In the entire medical literature to date, Santacruz found only 27 multi-center RCTs of critically ill patients reporting mortality benefit:

Of these, a majority (14/27) pertained to the management of noninvasive or invasive ventilation. So, it might be that how we ventilate patients matters a whole lot. However, I have a sneaking suspicion that this reflects a lack of blinding (e.g. you can’t blind clinicians to whether the patient is prone). Lack of blinding could have affected other processes of care (e.g. the level of attention that clinicians were providing), leading to bias.

Image result for stephen king full dark no stars

If we focus solely on blinded RCTs of pharmacologic interventions, this leaves us with only about 10 positive RCTs. The authors argued that in all of these cases, the treatment studied had been disproven or had fallen out of favor (e.g. one of these studies was Activated Protein C for sepsis – may it rest in peace). This ultimately leaves us with zero positive RCTs on pharmacologic therapies which have withstood the test of time.

I might quibble a bit about two of the studies involving hydrocortisone. Unlike most of the treatments investigated, steroid remains a useful therapy which is widely used. Steroid might have a mortality benefit which was correctly revealed in the APROCHSS trial, yet not detectable by the ADRENAL trial as discussed here. But this is ultimately quibbling – we could all likely agree that there is no robustly reproducible mortality signal from steroid.

So that’s really quite bleak. In the entire history of critical care research, no medication has ever been found by MC-RCTs to have a durable mortality benefit. Yikes.

interventions that increased mortality

Fewer trials (16) were reported showing an increase in mortality. This lower volume of studies showing harm might suggest publication bias. Again, a majority of these trials are nonblinded. Only seven of these studies involved medications.

Using a p-value cutoff of 0.05 will yield a 2.5% false-benefit rate and 2.5% false-harm rate, due purely to chance. Since this study involved a total of 212 trials, 5 would be expected to show increased mortality due purely to bad luck. Therefore, many of the treatments listed above which appeared to increase mortality were not actually harmful – they were just unlucky. Since these maligned therapies were subsequently abandoned, we’ll never know which ones they were.

overview of mortality endpoint in critical care MC-RCTs

The overall picture is remarkable:

80% of MC-RCTs reported no difference in mortality endpoints.
20% of MC-RCTs detected a difference in mortality (either increased or decreased). Of these studies, 58% were unblinded. This raises concern that lack of blinding might inflate the likelihood of detecting mortality differences.
5% of MC-RCTs are expected to report mortality differences due purely to chance (using a standard p-value cutoff of <0.05). These spurious studies likely constitute a quarter of studies which were believed to reveal mortality differences!
Zero medical therapies have ever been found to reduce mortality, in a robustly reproducible fashion.

Overall, this demonstrates that mortality endpoints are an awful tool for selecting good therapies. It’s likely that many beneficial treatments were overlooked using this insensitive endpoint. Furthermore, many treatments were falsely maligned or falsely believed to be beneficial (e.g. Activated Protein C). No “true-positive” treatment which reproducibly improves mortality ever emerged from this morass.

finding mortality benefit is getting even harder

And it gets even worse. If we consider the reasons that it’s difficult to prove mortality benefit, several of these are growing over time. For example:

Baseline mortality rate is steadily decreasing, due to ongoing improvements in critical care. The lower the baseline mortality rate is, the harder it is to demonstrate benefit from any therapy (it’s hard to improve on care which is already extremely good). This also makes it increasingly impossible to recruit enough patients to capture a substantial number of dying patients.
Patients die from numerous reasons, often unrelated to any specific disease process. Over time, the patients that we are managing are becoming increasingly complex and elderly, with numerous comorbidities. These patients are often dying from overall frailty, rather than a specific disease. Disease-oriented therapies cannot prevent these deaths.

Mechanical ventilation may illustrate this a bit. The classic ARMA trial showed a mortality benefit from using 6 cc/kg compared to 12 cc/kg.² Subsequently, ventilation standards improved and mortality decreased. Today, it would be vastly harder to design any trial proving a further mortality reduction.

This raises the possibility that we will never see any MC-RCT of any medical therapy in critical care medicine which reliably improves mortality. With each passing year, the likelihood of this occurring decreases.

it’s time to give up on mortality as a primary endpoint

Mortality will always be an important endpoint to consider (regardless of whether it’s a primary or secondary endpoint). Mortality signals will arise from time to time, so we should certainly pay attention to them. However, it seems untenable to insist on finding a mortality improvement. The traditional strategy of chasing a mortality endpoint with a p-value cutoff of <0.05 has failed us for decades. In the future, it will probably fail even worse.

Over time, the likelihood of affecting mortality will decrease (thus, decreasing the “signal” observed in mortality outcomes). Meanwhile, the likelihood of observing a change in mortality due to pure chance remains fixed at 5% (“noise”). This will decrease the signal/noise ratio in mortality outcomes. We may eventually reach a point where most observed differences in mortality are due purely to chance (noise > signal).

approaches to rectify the situation

#1 use non-mortality primary endpoints

The alternative to mortality endpoints is choosing an endpoint more proximal to the intervention (e.g. incidence of kidney failure, ventilator-free days, delirium rate, length of stay, etc.). Due to the higher incidence of these events, it is often easier to prove causality regarding more proximal endpoints.

There is one major problem with a non-mortality primary endpoint, however, that deserves special attention. Historically, multi-center critical care studies have often unfolded as follows:

The study is designed with a primary mortality endpoint. This requires recruiting a lot of patients – so we get a very large and well-executed study.
The primary mortality endpoint fails (no mortality difference is found). However, the study is so large that it definitively evaluates non-mortality endpoints with a high degree of statistical robustness (e.g. obtaining p-values <0.005).
Thus, although the study is technically “negative,” a large, robust study ends up shining lots of light on the secondary endpoints.

If we transition to a non-mortality endpoint and power studies against a p-value of 0.05, this would be a formula for small and statistically flimsy trials. We would get more “positive” trials, but they would be small, statistically fragile, and potentially misleading. Overall, this would probably be worse than having large, neutral trials. Thus, it’s conceivable that the best strategy could be to use a non-mortality endpoint, yet power the trial with a target p-value of <0.05 (so that we end up with adequately powered, robust studies targeting endpoints that are achievable). The concept of targeting p<0.05 has been explored previously here.

#2: re-interpret multi-center RCTs that have a mortality primary endpoint

Realistically, many multi-center RCTs will continue to be designed against mortality endpoints, for a variety of reasons (e.g. funding bodies often prefer a mortality endpoint). That’s fine; these are nonetheless extremely valuable studies. We just need to learn how to interpret them a bit better. First, we need to realize that while the mortality endpoint will generally be neutral, that doesn’t mean the intervention is worthless. Second, we should realize that these studies can still be useful by looking at the secondary endpoints (I’ve previously argued that if a trial is designed with an inappropriate primary endpoint, we can draw conclusions based on secondary endpoint(s).

non-significant mortality trends

On the topic of mortality endpoints, another phenomenon bears discussion: non-significant mortality trends. For example, let’s look at mortality in the ESETT trial (an RCT comparing different second-line antiepileptic agents).³ There was no statistically significant difference in mortality, but there was a trend towards more deaths in patients receiving levetiracetam:

These differences don’t come anywhere close to statistical significance. Using a Fisher Exact test, the comparison of mortality rate between levetiracetam vs. fosphenytoin has a p-value of 0.4 and the comparison between levetiracetam vs. valproate has a p-value of 0.2. Especially given the presence of multiple comparisons, p-values in this range are what one would expect due to random chance.

Nonetheless, this difference made its way into the abstract (above). I think it’s misleading to mention this in the abstract, because it’s overwhelmingly likely to be spurious. On twitter several respondents were concerned about these differences, suggesting that they would avoid levetiracetam for this reason (despite levetiracetam’s being widely regarded as one of the safest antiepileptic agents).

Statistically speaking, mortality trends are extremely likely to represent statistical noise:

They’re often unexpected or unexplainable. This gives them a low pre-test probability of being valid. For example, there’s no prior evidence or mechanism to explain why levetiracetam should kill people. This makes the likelihood that levetiracetam kills people considerably lower than 2.5%.
If we use a cutoff of p <0.05, then the rate of finding a statistically significant increase in mortality due purely to chance is 2.5%. However, the likelihood of finding a trend towards mortality is much, much higher! This depends on exactly how much of a “trend” folks get worried about, but it’s surely >>>2.5% (e.g. p<0.2 in the above example).
Now, we must consider the likelihood that the mortality trend represents a true increase in mortality (<<2.5%, as discussed above in #1) versus the likelihood that the mortality trend represents a spurious finding (>>>2.5% as discussed above in #2). The conclusion is that the vast majority of mortality trends will represent spurious results.

Mortality is an important endpoint, so we shouldn't ignore mortality trends entirely. However, the vast majority of these will be spurious. Thus, we should generally not change practice due to them.

In the history of critical care medicine, no medical therapy has ever reproducibly been shown to improve mortality in multi-center RCTs. Some treatments have been incorrectly found to improve mortality (e.g. Activated Protein C). Other treatments have likely been incorrectly found to increase mortality (we don’t know which, because they were promptly abandoned). The standard practice of chasing mortality endpoints with MC-RCTs and a p-value target <0.05 has failed the test of time.
There are several reasons that demonstrating mortality in critical care trials is nearly impossible. Over time, mortality rates are falling, which will make it even harder to demonstrate benefit from any treatment. Since the rate of statistical noise is fixed at 5% (using p<0.05), this leads to a situation where most observed effects on mortality represent random statistical noise.
Non-significant mortality trends are even more likely to be spurious. We should always pay thoughtful attention to mortality endpoints. However, if mortality trends don’t make sense in a larger context, then they’re likely spurious and shouldn’t affect management.

references

1.
Santacruz C, Pereira A, Celis E, Vincent J. Which Multicenter Randomized Controlled Trials in Critical Care Medicine Have Shown Reduced Mortality? A Systematic Review. Crit Care Med. 2019;47(12):1680-1691. doi:10.1097/CCM.0000000000004000
2.
Acute Respiratory Distress Syndrome Network., Brower R, Matthay M, et al. Ventilation with lower tidal volumes as compared with traditional tidal volumes for acute lung injury and the acute respiratory distress syndrome. N Engl J Med. 2000;342(18):1301-1308. doi:10.1056/NEJM200005043421801
3.
Kapur J, Elm J, Chamberlain J, et al. Randomized Trial of Three Anticonvulsant Medications for Status Epilepticus. N Engl J Med. 2019;381(22):2103-2113. doi:10.1056/NEJMoa1905795