by Rory Spiegel, MD
Absence of evidence is not evidence of absence.
Frequently, a p-value that fails to reach statistical significance (> 0.05) is interpreted as representing evidence at odds with the benefits of the therapy in question. In actuality, all this statistical insignificance is truly capable of indicating is a lack of evidence of benefit. This common logical fallacy is an important and subtle distinction.
Apneic oxygenation is a simple and elegant idea that was embraced by the Emergency Medicine community, despite minimal data supporting its efficacy in the ED. This was due in large part to its apparent safety, as well as its simplicity to implement. And yet it is not atypical that when we turn the harsh light of science towards a common practice, we are faced with answers that may displease us. The FELLOW Trial did just that. Published recently in Am J Respir Crit Care Med, the authors hoped to examine the efficacy of the revered practice of apneic oxygenation (1). Their results were universally negative, and yet so often in clinical medicine we have the tendency to confuse science for truth.
The FELLOW Trial was an open-label, pragmatically designed RCT performed in a single medical ICU. Semler et al randomized 150 critical ill adults who were undergoing emergent intubation to either standard intubation or standard intubation with the addition of apneic oxygenation (15 L/min via standard nasal cannula). Based off a negative primary endpoint, the authors concluded:
Apneic oxygenation does not appear to increase lowest arterial oxygen saturation during endotracheal intubation of critically ill patients compared to usual care. These findings do not support routine use of apneic oxygenation during endotracheal intubation of critically ill adults (1).
Clinical Validity
And yet the absence of evidence is not evidence of absence. There were a number of concerns that potentially hurt the trial’s internal validity. Approximately 70% of the patients in both arms received either bag-valve mask ventilation or bi-level positive pressure ventilation (BPAP) during the apneic period, with only a small minority of patients (30%) truly left to their own ventilatory devices. This potentially diluted the possible benefits observed from the use of apneic oxygenation, as its intention is to maintain oxygenation saturation without requiring positive pressure ventilation. A number of wonderful posts have discussed the various non-random errors (biases) that may have distorted these results. (See EMCrit and EMLITOFNOTE). In contrast, this post is intended as a reflection of random error and its potential effects in the FELLOW Trial.
Statistical Validity
The question the FELLOW Trial sought to answer was, “Does apneic oxygenation prevent clinically important desaturation during the peri-intubation period?” The authors chose the lowest arterial oxygen saturation during intubation, a continuous variable, as their primary endpoint. In contrast to discrete variables (alive vs dead, mRS 1, 2, 3 etc), continuous data can have any value within a set range (blood pressure, age, oxygen saturation) (2). The most common way to assess continuous data is to measure the difference between means. Essentially the mean of each cohort is obtained by adding up all the data points in each group, and then dividing by the total number of observed events. Typically, a simple t-test is used to assess the likelihood the observed difference in means occurred through random errors of sampling. A comparison of means is what is known as parametric testing. The major advantage of parametric testing is the statistical power it provides, allowing researchers to demonstrate statistical significant results with even small sample sizes (3).
In order for parametric testing to be a valid representation of the population, a number of assumptions must be met. The data surrounding a sample’s mean must fall in a symmetrical distribution. It is not uncommon, especially in cohorts with an undersized sample size, that the resultant sample will not exhibit a normal distribution. In cases where the resulting data is asymmetrically distributed, a comparison of means may fail to accurately describe the differences between the two populations of interest. In such instances non-parametric testing is required. Non-parametric testing eliminates the need for symmetric distribution by listing each data point in rank order. As such, the quantitative differences between measurements become inconsequential. This allows for the performance of statistical analyses without the assumption of normal distribution. Despite these advantages with this type of statistical manipulation, the ability to quantify the specific magnitude of each data point is lost. Rather non-parametric testing is only capable of locating the position of each data point as it relates to the remainder of the cohort. With this loss of granularity comes a significant reduction in the statistical power (3).
Semler et al powered their study to detect a 5% difference in lowest oxygen saturation during intubation between groups. The authors found no difference in their primary endpoint, the median lowest arterial oxygen saturation during intubation. 92% in patients randomized to apneic oxygenation versus 90% in the usual care group (95% confidence interval for the difference -1.6% to 7.4%; P = .16). Prior to data collection, the authors assumed a normal distribution and utilized a standard t-test when performing their power calculation. But in their published manuscript, the authors utilized a form of non-parametric testing, the Mann-Whitney U test, to analyze their primary endpoint. Likely after examining the data they found their results did not meet the required assumptions needed to perform a parametric analysis. In these cases even a direct comparison of the differences in the median values can be misleading. Instead the Mann-Whitney U test assessed the likelihood that the difference in rank distribution observed occurred by chance alone. Although methodologically appropriate, this shift from parametric analysis to non-parametric analysis severely limited the trial’s ability to detect a clinically meaningful difference in their primary endpoint (4).
The question remains, “Is the lowest oxygen saturation observed during intubation truly a clinically important question?” Such an endpoint inherently places value on higher oxygen saturation levels, making the assumption that oxygen saturation of 52% is clinically preferable to a saturation of 35%. The accuracy of waveform pulse oximetry suffers once the oxygenation saturation drops below 90%(5, 6). In these cases the value of the patient’s true PaO2 can vary wildly from what is recorded on the monitor. As such it is hard to place a hierarchical value to the pulse oximetry at levels less than 90%. To say that an oxygen saturation of 53% holds a greater clinical value than a value of 35% is inaccurate. To then further distill these data points into a rank order renders this data unusable.
A more clinically meaningful endpoint would be the rate of oxygen desaturation below a specific threshold that is associated with an increased likelihood of negative sequelae. The authors chose as their secondary endpoint the incidence of oxygen saturation less than 80% and 90%. In this case a continuous scale, oxygen saturation, is converted into a discrete dichotomous outcomes. The entirety of the data can now only fall above or below a specified cutoff. This arbitrary divide leads to a significant loss of granularity, as the value of an oxygen saturation of 75% becomes no different then one of 32%, as both fall below the 80% threshold. Although this binary partition may provide a more clinically relevant perspective, the conversion of a continuous variable to discrete data comes at the cost of statistical power (7).
The authors found the rate of saturation below 90% was 44.7% vs 47.2% (P = .87) in the apneic oxygenation and control groups respectively. The rate of oxygen saturation less than 80% was 15.8% vs 25.0% (P = .22) respectively. The rate of in-hospital mortality was 35.1% and 49.3% (P = .10) respectively. Essentially there was an approximate 10% absolute difference in the rate of clinically meaningful desaturation in patients randomized to apneic oxygenation vs control. Additionally there was an approximate 15% reduction in in-hospital mortality in patients randomized to apneic oxygenation. Neither of these clinically important differences reached statistical significance, and are mentioned here only to illustrate that this trial was severely underpowered to detect a clinically important disparities in outcomes. Because the authors powered their study to detect a difference in a continuous variable, they concurrently hindered their ability to distinguish clinically important differences in dichotomous outcomes from statistical chance.
Trials can only answer questions they were designed to ask. The FELLOW Trial was incapable of differentiating a 24% absolute difference in the rate of desaturation below 80% from statistical chance. The authors use of a continuous variable as their primary endpoint and post-hoc, non-parametric analysis severely limited the study’s statistical power. Statistical power that was even further damaged by the fact that more than half the cohort received some form of positive pressure ventilatory support during the period they were intended to be left apneic. It is certainly fair to say that there is an absence of evidence supporting apneic oxygenation, but to hold the FELLOW Trial up as evidence of absence is premature to say the least.
Sources Cited:
- Semler MW, Janz DR, Lentz RJ, et al. Randomized Trial of Apneic Oxygenation during Endotracheal Intubation of the Critically Ill. Am J Respir Crit Care Med. 2015
- Matthews J N S, Altman D G, Campbell M J, Royston J P. Analysis of serial measurements in medical research. BMJ1990; 300:230-5.
- Sedgwick Philip. Parametric v non-parametric statistical tests BMJ 2012; 344 :e1753
- Hart A. Mann-Whitney test is not just a test of medians: differences in spread can be important. BMJ. 2001;323(7309):391-3.
- Van de Louw A, Cracco C, Cerf C, Harf A, Duvaldestin P, Lemaire F et al.. Accuracy of pulse oximetry in the intensive care unit. Intensive Care Med. 2001; 27:1606-13.
- Jubran A, Tobin MJ. Reliability of pulse oximetry in titrating supplemental oxygen therapy in ventilator-dependent patients. Chest. 1990
- Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ. 2006;332(7549):1080.
Editorial Addendum by Scott Weingart, @EMCrit
In Podcast 158, I attempted to discuss the FELLOW trial from the only perspective I am ever sure of when analyzing trials–the clinical perspective. I am not a methodologist, though I dabble in these dark arts every now and then. My methodology spidey sense was tingling when I read this paragraph in the supplement to the FELLOW trial:
Power Calculation
In the absence of prior data within our study population, we felt the randomized trial of
non-invasive ventilation for pre-oxygenation by Baillard et al1 represented the best trial on which
to base our power calculation. That study was powered to detect a minimum difference in lowest
arterial oxygen saturation during endotracheal intubation of 5 percent.
While the author explained why this choice was made in my podcast interview, I was still concerned that strategy would in no way ensure an adequately powered trial, especially given that less than a third of the patients were actually exposed to apnea in the control arm. Rory's comments above give voice to my inchoate EBM fears.
What do you think? Have Questions? I'll bring Rory on the show to address anything you mention below…
- EM Nerd-The Case of the Partial Cohort - May 24, 2020
- EM Nerd: The Case of the Sour Remedy Continues - January 20, 2020
- EM Nerd-The Case of the Adjacent Contradictions - December 23, 2019
Awesome stuff Rory! I think your summary really helps make clear that this trial doesn’t truly answer the question of whether apneic oxygenation really works or not. It’s hard to frame it as one that does, given that only 30% of patients were truly apneic. I’d like to see a trial that truly randomizes ApOx vs none, with a standard method of PreOx for all comers. What sample size would be needed? Until then, it makes sense to keep on with an intervention that makes a lot of physiological sense with no known harms, no appreciable cost, and one that’s… Read more »
[…] Rory Spiegel on EMCrit: EMNerd – More on the FELLOW Trial […]
[…] in depth look at some of the issues of the FELLOW trial for apneic oxygenation and why this evidence does not prove the absence of benefit. […]
Love the critical thinking and breakdown of the trial – thanks for the time and effort. I am concerned that the loss of power due to nonparametric hypothesis testing is a bit overstated. Nonparametric tests are, by nature, more conservative – given our fear of type I error, this is a benefit. As you said, it comes at the cost of decreasing power. However, that decrease in power is not constant; usually for trials greater than 100 participants, the decreased power is negligible. Especially since all results were non-significant, I would not consider the choice of appropriate statistical test a… Read more »
HI William, Thank you so much for the kind words. I think your points on non-parametric testing are entirely valid. My intention of the post was to discuss the fact that this trial was incapable of differentiating a clinically important difference from statistical chance. Choosing a continuous variable of questionable clinical significance as their primary outcome, likely was far more of a factor in why the study was underpowered than their use of a non-parametric analysis but I thought it was important to discuss both factors in the analysis. To your second point, from my perspective the goal of apneic… Read more »
The unpinning philosophy of the P value is that the null hypothesis is correct. Therefore, I think that your opening statement is incorrect. This is explained in 1 and 2. It is also tempting to look for absolute differences that are non-significant then explain that the study is underpowered. The reader is then left with the impression that had the study been bigger then the differences would become significant. This is also incorrect and explained in 3. 1. Goodman SN. Towards evidence based medical statistics. The P value fallacy. Ann Intern Med 1999;130:995-1004. 2. Sedgwick P. Understanding P values. BMJ… Read more »
Hi Richard, Thank you so much for your response. While I completely agree with your statement, “the unpinning philosophy of the p-value is that the null hypothesis is true”, I don’t think a failure to reject the null hypothesis proves the null hypothesis. Rather it states the data was unable to demonstrate a difference. Goodman discusses this misconception in (1). Yes I agree any retrospective power analysis in a negative trial will invariably demonstrate an underpowered trial. Rather using a confidence interval approach to determine whether a trial is capable of differentiating a clinically important outcome from statistical chance is… Read more »