This paper slipped across my twitter feed over the weekend.^{1} It was a bit disquieting to see that it was getting a lot of attention, despite being a methodological train wreck (seriously, MedTwitter, where’s the skepticism??). This post will briefly walk through some of the main flaws. There will be a bit of pharmacology, a modicum of methodology, and a lot of ranting.

In short, these authors sought to determine which antibiotics cause delirium in the ICU. They used a data set obtained from a prior study including 418 patients.^{2} Administration of various antibiotics was correlated with the occurrence of delirium. Since delirium may be affected by innumerable factors, this design suffers from considerable risk of confounding (e.g. infection causes delirium, with antibiotics merely being a *surrogate* *marker* for infection). Two multivariable regression models were used in attempts to remove the effect of confounding variables.

#### Flaw #1: Retrospective correlational study

Retrospective correlational studies are a paradox by design:

- A retrospective correlational design can never prove causality; it can only generate hypotheses.
- In reality, the vast majority of these studies are designed by someone who
*already*has a hypothesis in mind which is being promoted.

The literature is chock full of these studies (heck, even I’ve published one).^{3} The great majority are uninformative or misleading. Occasionally, these studies can be useful (mine, for example). Overall, though, they must be interpreted with many, many grains of salt:

#### Flaw #2: Weird selection of variables

The authors divided antibiotics into seven groups:

- First, second, and third generation cephalosporins
- Fourth-generation cephalosporins
- Penicillins
- Carbapenems
- Fluoroquinolones
- Macrolides
- Everything else (vancomycin, antifungals, antivirals, metronidazole, aminoglycosides, linezolid, etc.)

The division of cephalosporins into Generations 1-3 versus cefepime makes little sense:

- First-generation cephalosporins (cefazolin) have poor meningeal penetration.
- Third-generation and fourth-generation cephalosporins have good meningeal penetration, potentially increasing the risk of delirium.

It would make more sense to break cephalosporins into three groups (first, third, and fourth generation). The reason this study combined the first, second, and third generation cephalosporins was probably out of necessity, in order to generate a group of patients large enough to analyze (only 13 patients received first or second-generation cephalosporins). This post-hoc conglomeration of variables to accommodate small sample size isn’t methodologically kosher.

#### Flaw #3: The sin of multiple unadjusted comparisons

This study analyzed six groups of antibiotics using two different regression models, for a total of *twelve* comparisons (without any mention of a primary analysis or primary endpoint). Using a Bonferroni correction, to correct for these twelve comparisons and maintain an overall statistical cutoff of *p*<0.05, each individual comparison should be evaluated with a lower *p*-value:

*p*-value for each comparison < 0.05/12 < 0.004

Indeed, the paper hinges on only one of the comparisons having a *p*-value exactly equal to 0.004! This number might seem impressively low at first blush, but it shouldn’t. Taken together, the entire analysis is teetering on the edge of the *p*=0.05 precipice. As explored in a prior post, this is a very tenuous place to be. Some statisticians have suggested lowering the *p*-value cutoff to <0.005 for exploratory analyses, which would result in throwing this paper out on its ear.

It’s worth noting that the proportional odds logistic model is a more traditional approach to this sort of analysis. It is unequivocally negative here (none of the six comparisons even reaches a *p*<0.05 threshold). In comparison, the cluster-sandwich model picks up more signals as statistically significant. This is probably because the cluster-sandwich model is examining each day from within a patient’s stay separately as a discrete event, thereby increasing its sensitivity (perhaps too much).

#### Flaw #4: Signal/noise problem due to mixing of strong and weak variables

This is a common problem. For the sake of simplicity, let’s break variables into two groups:

- 1) Major influencers of delirium
- Disease severity (e.g. SOFA score)
- Intubation status
- Benzodiazepine exposure

- 2) Minor influencers of delirium
- ICU type
- Antibiotic flavor

Major influencers will have a far greater impact on delirium outcomes than minor influencers. Ideally a multivariable model would be able to eliminate the effect of all confounding variables, so that we could isolate the effect of our variable of interest (antibiotic flavor). However, in reality multivariable models just aren’t this magical. They can *minimize* the effect of confounding variables, but they can’t *eliminate* them entirely. Thus, it’s likely that *some* effect from major influencers will seep through, overshadowing the impact of minor influencers.

#### Flaw #5: Two models disagree with each other.

This study analyzes the data using two regression models. They are presented on separate tables on different pages. Look what happens when we combine both results into one table:

It suddenly becomes uncomfortably clear that these two analyses disagree. Let’s compare the two models using a scatter plot:

This graph illustrates the effect of each variable on delirium, using both models. The level of agreement is extremely poor (with a correlation coefficient of only 0.2!). For example, dexmedetomidine *positively* correlated with delirium using one model, but *negatively* correlated with delirium using the other model. Poor agreement between these two models suggests that we’re chasing statistical noise here, not reproducible signals.

#### Flaw #6: Face validity: Failure to detect delirium from fluoroquinolones or cefipime

According to this analysis, neither cefepime nor fluoroquinolones cause delirium. Nonsense. There is a reasonable body of literature supporting that both fluoroquinolones and cefepime cause delirium.^{4–9} In particular, fluoroquinolones antagonize the GABA receptor, causing some folks to go completely bonkers with agitated delirium (the technical term for this is – seriously, and for good reason – *antibiomania*).^{10,11}

#### Flaw #7: Underpowering

So, why *didn’t* the study detect increased delirium due to cefepime or fluoroquinolones? This is simply a matter of underpowering. Powering a study to look for uncommon adverse events is never easy. This study is no exception. To understand this better, let’s take the case of cefepime.

Previously a study found that cefepime caused delirium in 15% of patients.^{9} Let’s imagine this is precisely true. How easy or hard would it be to detect an effect size of 15% within this study?

The study included 64 patients treated with cefepime. So if cefepime causes delirium in 15% of patients, it might be expected to cause delirium in 10 of these 64 patients. So we’re looking for a difference of 10 patients.

But wait, it’s not that simple! Overall there is a *baseline rate* of delirium of 75% among all patients, because delirium in the ICU is extremely common. So, the baseline rate of delirium among this subgroup would be ~48/64 without any cefepime at all. We can only detect the impact of cefepime among the remaining 16 non-delirious patients. So we’re actually looking for a difference of (15%)(16) = 1 patient! We're looking for a needle (cefepime-induced delirium) in a haystack (delirium due to everything else).

Overall, due to on the small number of patients in each sub-group and the large base-line rate of delirium, this study was woefully underpowered to detect small differences in delirium due to antibiotics. Therefore, the lack of evidence that cefepime or fluoroquinolones caused delirium doesn’t absolve these antibiotics. This result was *inevitable* by design (there was no attempt at power calculations before the study).

#### Flaw #8: Dangerously bold conclusions

Many sins can be forgiven if the conclusion is sufficiently conservative. A reasonable conclusion to this study might be: *Perhaps early-generation cephalosporins are more deleriogenic than we realized; this deserves further study*.

Instead, the conclusion section doubles down:

The conclusion that neither cefepime nor fluoroquinolones cause delirium is wrong. The subsequent implication that encephalopathic patients on cefepime or fluoroquinolones should continue such medications is potentially dangerous.

- The road to hell is paved with retrospective, correlational studies. Such studies cannot prove causality, so they should only very rarely affect practice.
- If there are multiple simultaneous comparisons, the
*p*-value cutoff should be reduced accordingly (especially since*p*<0.05 isn’t a very stringent cutoff to begin with). - Multiple-regression models can neutralize some of the influence of confounding variables, but it’s not magical nor can it cannot remove confounders entirely.
- The results of a study should ideally have some
*face validity*(e.g. agreement with prior studies). - If a dataset is analyzed using different techniques, the results should be fairly consistent across different statistical techniques.
- Underpowering in correlational studies is potentially problematic, so inability to detect an effect doesn't necessarily prove that no effect exists.

#### Related

- P-values: .050 shades of grey
- Fake news: Correlational studies
- IBCC chapter on delirium
- Six reasons to avoid fluoroquinolones in critical illness

#### References

*C*. 2018;22(1). doi:10.1186/s13054-018-2262-z

*N Engl J Med*. 2013;369(14):1306-1316. [PubMed]

*Lung*. 2010;188(2):173-178. [PubMed]

*J Clin Med Res*. 2018;10(9):725-727. [PubMed]

*Psychosomatics*. 2018;59(3):259-266. [PubMed]

*J Anaesthesiol Clin Pharmacol*. 2015;31(3):410-411. [PubMed]

*Curr Drug Saf*. August 2013. [PubMed]

*Crit Care*. 2017;21(1):276. [PubMed]

*Crit Care*. 2013;17(6):R264. [PubMed]

*J Clin Diagn Res*. 2016;10(12):VL01. [PubMed]

*Psychosomatics*. 2007;48(4):363. [PubMed]

### Josh Farkas

#### Latest posts by Josh Farkas (see all)

- IBCC chapter & cast – Non-Anion Gap Metabolic Acidosis (NAGMA) - September 19, 2019
- IBCC chapter & cast:Anion-Gap Metabolic Acidosis - September 17, 2019
- IBCC chapter & cast – Approach to pH diagnosis - September 12, 2019

These garbage studies only exist to pad resumes.

Josh

I don’t run with biostatisticians. However, assuming, like docs, there is a minimum competency threshold, i.e., treat K+>7.0, or accountants and their playbook (GAAP), stat folks and their ilk would look at data, hit their forehead, and say, “of course, Bonferroni.” What happens? And this is a serious question. WHy?

THanks

Brad

Articles deal with multiple comparisons in a variety of different ways, so there isn’t necessarily a single way that it should or must be done. Some articles ignore this issue entirely, others calculate specific lower p-value cutoffs (e.g. with Bonferroni), and some articles go so far as to avoid printing a lot of p-values. I’m sure it wouldn’t be difficult to find statisticians that think Bonferroni is either too strict or too lax of a cutoff as well.

So I agree that this is a problem, and it’s a very common problem. It’s probably more of an issue with the field of statistics lacking a single unified best-practice standard, rather than a failure of an individual statistician. And ultimately some of the responsibility rests also with the readers to keep track of the number of comparisons going on.

[…] road to hell is paved with retrospective, correlational studies, om en sådan som försökte se huruvida olika slags ab kunde ge konfusion i en grupp intensivvårdade där andelen konfusoriska var ca 3/4. Suck. Kan man inte göra en […]