PulmCrit - Oscar awards for the best COVID prognostic models

We are continually tasked with triaging COVID patients, a situation which will become more complex as the numbers continue to rise. This involves making educated guesses about which patients are most likely to deteriorate, and which patients may benefit most from critical care. That’s enormously difficult.

At this point, we’re quite familiar with individual risk factors for poor outcomes in COVID-19. The tough part is that there are literally dozens of them (e.g., age, comorbidities, elevated LDH, elevated CRP, D-dimer, hypoxemia, tachypnia, diffuse infiltrates). The challenge is how to integrate these risk factors in the most accurate fashion. For example, if a patient has lymphopenia and an elevated C-reactive protein, is that twice as bad as lymphopenia alone? Or are lymphopenia and C-reactive protein measuring the same underlying phenomenon – such that if we consider them separately, we’re “overcounting” a single underlying risk factor?

A well designed and validated prognostic model solves the problem of how we can combine numerous data points in the most accurate fashion possible. It also should allow us to do this in an objective, reproducible, and fair way.

the problem with most COVID prognostic models

Dozens of prognostic models for COVID have been created. The vast majority are not currently suitable for clinical use.¹ Common problems with most models include the following:

Derivation from homogeneous populations of patients (which may fail to generalize accurately in more diverse contexts).
Derivation from small populations of patients. Models are often designed to predict death, an uncommon event. If the sample size is too small, the model may be subject to overfitting (see figure below).
Lack of validation: Ideally, a prognostic model should be validated by an external study of patients at another center.
Impossibly complex models: Some models require advanced computations that cannot be readily replicated.
Reliance on laboratory values which vary between laboratories: for example, D-dimer values may vary considerably depending on the assay.

An example of a popular model which fails these criteria is the COVID-GRAM model.² This model was created early in the pandemic and was quite appealing (based on logical, transparent construction). However, the COVID-GRAM model largely failed replication in two studies evaluating its ability to predict mortality (with an area under the receiver-operator curve of 0.64 and 0.66).^3,4

One case report describes a 67-year-old patient with advanced squamous cell carcinoma whose risk of critical illness was predicted to be 99.3% based on the COVID-GRAM prognostic tool. Based partially on this prognostication, the patient was treated with a comfort-directed plan of care. However, the patient actually went on to recover and survive his illness. This illustrates the risk of placing excess faith in a prognostic score.⁵

best overall COVID prognostic score: 4C mortality score

The 4C mortality score was created by a consortium of investigators working in 260 hospitals across England, Scotland, and Wales (4C = Coronavirus Clinical Characterisation Consortium).⁶ The model was derived from a dataset of 35,463 patients between February and May, and was subsequently validated prospectively among 22,361 patients between May and June.

The model, as well as interpretation of scores, are shown below.

The score is available at MDCalc. The ISARIC consortium also created an online risk calculator, which plots out the patient’s risk nicely:

The score performed well with an area under the receiver-operator curve of 0.79 in the derivation cohort and 0.77 in the validation cohort. Among patients in the validation cohort, the 4C model performed better than a variety of other models:

The authors also created a more sophisticated machine-learning model. The machine-learning model did perform better than the 4C model – but only by a trivial degree. This implies that although the 4C model is simplified for easy clinical use, it still performs almost identically to a more sophisticated model.

The 4C model has been validated by investigators in Italy, who found that it was the most accurate of several models (with an area under the receiver-operator curve of 0.799).⁷ Thus, the 4C model has been validated in two international patient cohorts, making it perhaps the most robustly validated COVID prognostic model. Another unique strength of the model is that Knight et al. validated it across a variety of different ethnic groups:⁶

It’s essential to appreciate that the 4C model predicts mortality (not the need for admission or for ICU care). For example, a patient who is admitted to ICU, intubated, and who eventually survived would be simply considered as a “survivor” in this model. Thus, caution is required when interpreting low scores – simply because the risk of death is low, this doesn’t guarantee that the need for advanced medical intervention is low.

Over the course of the pandemic, we have seen shifts in mortality rates (e.g., due to strain on the hospital system, and to the fraction of patients seeking medical attention). Therefore, the 4C model cannot be expected to predict each patient’s absolute mortality. For example, the mortality rate among patients used to create the model was ~30% – which is pretty high. This may be partially explained by the model’s being generated based solely on data from admitted patients, which will inevitably cause it to overestimate mortality (compared to the total population of infected persons, which includes many outpatients). Regardless, the model should still serve to distinguish higher-risk versus lower-risk patients.

Disposition decisions are always nuanced, depending on numerous factors (e.g., intensity of follow-up and available resources). However, the 4C score could be used as one factor to help determine patient disposition. A 4C score of 1-3 suggests low mortality, which might be consistent with disposition home or to a low-intensity monitoring facility outside of a hospital. Intermediate-risk stratification might suggest admission to a ward, whereas higher risk patients might benefit from step-down or ICU level care.

best bedside physiologic risk score: quick COVID Severity Index (qCSI)

The 4C score integrates three types of information:

Baseline characteristics (e.g., age, comorbidities)
Acute physiological abnormalities (e.g., oxygen saturation, respiratory rate)
Laboratory abnormalities (e.g., C-reactive protein)

Bedside physiologic risk scores focus on just one of these types of information: Acute physiological abnormalities. Ignoring baseline characteristics and laboratory abnormalities will obviously make physiological risk scores less powerful in predicting overall mortality. However, focusing down on acute physiological abnormalities has the following advantages:

Bedside physiological risk scores can be easily and immediately calculated at the bedside, without delays required for laboratory testing.
Physiological risk scores can be repeated in serial fashion to determine the patient’s trajectory. By focusing purely on current physiological disarray, these scores can track a patient’s progress over time. A major use of these scores is as an early warning system, to detect deterioration.

Among various early warning scores, the quick COVID Severity Index (qCSI) appears to be the best suited for COVID.⁸ qCSI was developed at Yale with the goal of predicting which patients would progress to respiratory failure within 24 hours of admission (defined as a composite of requiring >10 liters/minute oxygen, high-flow nasal cannula, noninvasive ventilation, invasive ventilation, or death). Similar to the 4C score, this score was based solely upon studying patients admitted to the hospital.

The qCSI involves three simple variables: respiratory rate, lowest pulse oximetry recorded during the initial four hours of the emergency department stay, and the patient’s oxygen flow rate. It can be easily calculated at MDCalc:

Initially the qCSI model was derived from 932 patients and validated on an independent set of 240 patients. Among the validation cohort, the area under the ROC curve was 0.81. Patients can be divided into four groups with differing risk of respiratory failure:

These investigators also created a more complex model including additional variables, such as age and laboratory tests. However, the addition of these variables didn’t improve the model’s performance significantly. This suggests that for short-term prediction of respiratory failure, current respiratory physiology may be of paramount importance. Alternatively, it’s possible that the more complex model was overfitted, causing it to generalize poorly to subsequent cohorts of patients.

Investigators in Italy evaluated the ability of various models to predict mortality among 210 patients admitted with COVID.⁷ qCSI performed nearly as well as much more complex models (e.g., the 4C score), despite the fact that qCSI wasn’t designed for mortality prediction (table below). The performance of qCSI was essentially identical to the NEWS score (NEWS is a highly validated early-warning score that quantifies acute physiological abnormalities). This performance is overall impressive: the streamlined qCSI seems to be punching above its weight.

Investigators in Chicago evaluated the ability of three models to predict ICU admission among 313 patients admitted with COVID (qCSI; CURB65, which is a common pneumonia risk stratification tool; and the Brescia-COVID Respiratory Severity Scale).⁹ qCSI had the best performance, with an area under the ROC curve of 0.76. However, most of the patients who transferred to ICU had a qCSI score of 1-3, which is relatively low. This could relate to hospital-specific ICU transfer criteria (e.g., transfer in anticipation of respiratory failure, rather than in response to frank respiratory failure).

It’s unclear precisely how qCSI would add to expert clinical assessment. Tracking the exact respiratory rate, oxygen demands, and the clinical gestalt is probably the best way to determine how sick COVID-19 patients are. However, if you’re looking for a simple bedside physiological assessment score, qCSI is a reasonable option. Additional work to clarify precise cutoff values would help further establish its generalizability.

runner up physiological risk score: NEWS2 score

Among physiological early warning scores, NEWS2 might be the best validated. It has been widely utilized for years and validated for a variety of conditions. An Italian study by found that a NEWS2 score of six or more suggested that intensive care may be required for COVID patients:¹⁰

Likewise, a study from Wuhan found that the NEWS2 score performed well to predict the development of more severe disease (with an area under the ROC curve of 0.882).¹¹ A NEWS2 score of 6 or more predicted more severe disease with roughly 80% sensitivity and specificity.

NEWS2 can be calculated as shown in the figure below (using the Spo2 scale #1, which is designed for hypoxemic respiratory failure). The NEWS2 score can be easily calculated using MDCalc here.

Liu et al. discovered something very interesting about the NEWS2 score in predicting mortality among COVID patients.¹² Solely evaluating the oxygen saturation subscore within the NEWS2 yielded essentially the same performance as the entire score (the oxygen saturation subscore had an area under the ROC curve of 0.875, compared to the entire composite score’s area under the ROC curve of 0.880; see below figure). The respiratory rate subscore had an area under the curve of 0.687, with the remaining subscores of NEWS2 having almost no predictive capacity (area under the curve close to 0.5). This suggests that the NEWS2 score’s performance in patients presenting with COVID is driven almost entirely by respiratory physiology variables (with other variables not adding substantially to it). This makes sense, because patients generally present with isolated respiratory failure. Gupta et al. similarly found that admission oxygen saturation on room air alone had the same performance for predicting deterioration as the entire NEWS2 score.¹³

Given the importance of respiratory physiology, it’s possible that NEWS2 doesn’t place enough emphasis on this. For example:¹⁴

Overall, the NEWS2 score does perform well and is well-validated for use in COVID to predict clinical deterioration. So if your unit is already using the NEWS2 score, that’s a fine choice. However, quantifying all of the non-respiratory variables is probably a waste of time for patients presenting early with COVID pneumonia. For such patients, a shorter score which focuses purely on respiratory physiology (e.g., qCSI) could achieve equivalent performance with much less effort.

where should decision support tools fit in clinical medicine?

We’re surrounded by decision support tools. Like the 4C score and the qCSI score, these tools are typically compared to other decision support tools, to determine which is best. The more important question is how these decision support tools compare to real-world clinical judgement. When decision support tools are compared to clinician judgement, the best decision support tools are usually not superior to clinical judgment.¹⁵ Notably, none of the above models achieves an area under the ROC curve over 0.8 – so they are all far from perfect.

Thus, prognostic scores obviously shouldn’t replace clinician judgment (for example, when determining patient disposition). They might, however, serve as an electronic second opinion. For example, if you decided to triage a patient to the medicine ward, but then calculated their 4C score to be 20 (very high risk of mortality) – well, you might want to reconsider that decision. The ward might still be the right place for that patient, but the decision merits reconsideration.

Dozens of prognostic scoring systems have been designed for COVID-19. Most have not been adequately validated and are not currently appropriate for general clinical use.
The 4C score currently appears to be the most robust and best-validated overall score to predict COVID-related mortality. Shifts in mortality over time may prevent the 4C score from correctly predicting the absolute mortality risk – but the score may remain a useful risk-stratification tool.
qCSI may be the most effective physiological risk score. It is a simple bedside tool which helps predict the patient’s likelihood of progressing to respiratory failure within the next 24 hours.
It remains unknown how much prognostic scores might add to our clinical judgement. Like other decision support tools, these certainly cannot supplant clinical judgement – but they might be useful as an electronic second opinion.

qCSI score – RebelEM by Anand Swaminathan
Brescia-COVID respiratory severity scale – FOAMCast by Lauren Westafer
Best calculators:
- 4C score
- qCSI score
Old theoretical post about severity indicies (PulmCrit)

Image Credits: By Happymeluv at Vecteezy

references

1.
Wynants L, Van C, Collins G, et al. Prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal. BMJ. 2020;369:m1328. doi:10.1136/bmj.m1328
2.
Liang W, Liang H, Ou L, et al. Development and Validation of a Clinical Risk Score to Predict the Occurrence of Critical Illness in Hospitalized Patients With COVID-19. JAMA Intern Med. 2020;180(8):1081-1089. doi:10.1001/jamainternmed.2020.2033
3.
Al H, Cocks E, Jesani L, Lewis S, Szakmany T. Clinical Risk Prediction Scores in Coronavirus Disease 2019: Beware of Low Validity and Clinical Utility. Crit Care Explor. 2020;2(10):e0253. doi:10.1097/CCE.0000000000000253
4.
Yildiz H, Yombi J, Castanares-Zapatero D. Validation of a risk score to predict patients at risk of critical illness with COVID-19. Infect Dis (Lond). Published online October 1, 2020:1-3. doi:10.1080/23744235.2020.1823469
5.
Bisson E, Presswood E, Kenyon J, Shelton F, Hall T. Against the odds: unlikely COVID-19 recovery. BMJ Support Palliat Care. Published online July 3, 2020. doi:10.1136/bmjspcare-2020-002477
6.
Knight S, Ho A, Pius R, et al. Risk stratification of patients admitted to hospital with covid-19 using the ISARIC WHO Clinical Characterisation Protocol: development and validation of the 4C Mortality Score. BMJ. 2020;370:m3339. doi:10.1136/bmj.m3339
7.
Covino M, De M, Burzo M, et al. PREDICTING IN-HOSPITAL MORTALITY IN COVID-19 OLDER PATIENTS WITH SPECIFICALLY DEVELOPED SCORES. J Am Geriatr Soc. Published online November 16, 2020. doi:10.1111/jgs.16956
8.
Haimovich A, Ravindra N, Stoytchev S, et al. Development and Validation of the Quick COVID-19 Severity Index: A Prognostic Tool for Early Clinical Decompensation. Ann Emerg Med. 2020;76(4):442-453. doi:10.1016/j.annemergmed.2020.07.022
9.
Rodriguez-Nava G, Yanez-Bello M, Trelles-Garcia D, Chung C, Friedman H, Hines D. Performance of the Quick COVID-19 Severity Index and the Brescia-COVID Respiratory Severity Scale in hospitalized patients with COVID-19 in a community hospital setting. Int J Infect Dis. Published online November 9, 2020. doi:10.1016/j.ijid.2020.11.003
10.
Gidari A, De S, Sabbatini S, Francisci D. Predictive value of National Early Warning Score 2 (NEWS2) for intensive care unit admission in patients with SARS-CoV-2 infection. Infect Dis (Lond). 2020;52(10):698-704. doi:10.1080/23744235.2020.1784457
11.
Myrstad M, Ihle-Hansen H, Tveita A, et al. National Early Warning Score 2 (NEWS2) on admission predicts severe disease and in-hospital mortality from Covid-19 – a prospective cohort study. Scand J Trauma Resusc Emerg Med. 2020;28(1):66. doi:10.1186/s13049-020-00764-3
12.
Liu F, Sun X, Zhang Y, et al. Evaluation of the Risk Prediction Tools for Patients With Coronavirus Disease 2019 in Wuhan, China: A Single-Centered, Retrospective, Observational Study. Crit Care Med. 2020;48(11):e1004-e1011. doi:10.1097/CCM.0000000000004549
13.
Gupta R, Marks M, Samuels T, et al. Systematic evaluation and external validation of 22 prognostic models among hospitalised adults with COVID-19: An observational cohort study. Eur Respir J. Published online September 25, 2020. doi:10.1183/13993003.03498-2020
14.
Lim N, Pan D, Barker J. NEWS2 system requires modification to identify deteriorating patients with COVID-19. Clin Med (Lond). 2020;20(4):e133-e134. doi:10.7861/clinmed.Let.20.4.6
15.
Schriger D, Elder J, Cooper R. Structured Clinical Decision Aids Are Seldom Compared With Subjective Physician Judgment, and Are Seldom Superior. Ann Emerg Med. 2017;70(3):338-344.e3. doi:10.1016/j.annemergmed.2016.12.004