The Adventure of the Three Students

There are few things that make emergency medicine nerds more excited than decision rules. So when the Canadian gods of decision instruments, Jeff Perry and Ian Stiell, published the results from the validation cohort of their recently derived SAH decision rules they of course had our full attention.

On September 25^th 2013 JAMA published this very article entitled “Clinical Decision Rules to Rule Out Subarachnoid Hemorrhage for Acute Headache”. This was a multicenter prospective cohort attempting to validate the three rules derived by Perry et al published in the BMJ in 2010(2). The knowledge that these instruments were crafted by Dr Perry and Dr Stiell ensures that they were done so in the most methodologically sound fashion. Even with the highest of standards the inherent flaws of constructing a decision rule can lead to instability of the rule and its subsequent failure when it is validated.

This statement should in no way detract from the amazing work Perry et al did when deriving and validating their decision instruments for SAH. It is a wonderful prospective dataset, the likes of which we have never seen before, and one that may never again be duplicated. This is more a comment of the innate weakness of decisions rules in general.

The Derivation….

The authors used recursive partitioning to construct the three decision instruments attempting to safely rule out SAH. Recursive partitioning is a form of multivariable regression that many practitioners prefer to linear or logistic regression as it is more practical and user friendly. Bayesian in nature, using a set of clinical variables, it attempts to separate patients based on risk stratification for a given pathology. The rules are built using what is called a “statistical tree”. Imagine you are trying to construct a rule that helps you decide who is having a myocardial infarction (Fig. 1). You take a cohort of 100 people, 10% of which are having an infarction. You then divide your group up by the presence or absence of EKG changes. Only 3 of the 10 patients having an MI in this cohort have EKG changes and 10 patients without an MI also happen to have EKG changes. This is your first branch in your decision tree. If you stopped here your decision rule would have a sensitivity of 30% and a specificity of 89%. You would miss 70% of the cohort’s infarctions, a poor decision rule by any standards. To further increase the sensitivity of your rule the next branch you add is positive or negative cardiac enzymes. In this branch you have captured an additional 5 of the 7 remaining MIs. Unfortunately you have also included 47 patients who are not having an MI. Since you decide you want to build a rule that is 100% sensitive you add a further branch and include age below or above 30. In the final branch you are able to capture the remaining to MIs, deriving a rule with the potential to be 100% sensitive. This rule has been built around a specific cohort, but to test its stability you must now apply it to an entirely separate group of patients.

This is where most decision rules fail. A rule is built and with each subsequent branch the rule becomes more and more dependent on the specific characteristics of the cohorts it is being constructed from. Variables that seem statistically important in one cohort may be the result of random chance and not because they are associated with the disease process we are attempting to isolate. In such cases the rule will never perform as well when it is applied to an external population. This is what is called “overfitting”. It is more likely to occur when you derive a rule from small cohorts, use an excessive amount of branches in your statistical tree, or when you dredge your data for opportunistic variables. Even when everything is done perfectly (as was the case in the Perry derivation set) the stability of decision rules are inherently flawed(4).

The validation…

“Clinical Decision Rules to Rule Out Subarachnoid Hemorrhage for Acute Headache”(1) is an attempt to validate three rules for clinical use. The authors enrolled 2131 patients with sudden onset headaches (defined as reaching its maximum intensity in under one hour). Of these 132(6.2%) had a SAH defined by any one of the following: subarachnoid blood on unenhanced CT of the head; xanthochromia in the cerebrospinal fluid; or red blood cells (>1 x10^6/L) in the final tube of cerebrospinal fluid, with an aneurysm or arteriovenous malformation on cerebral angiography.

All three rules (Fig. 2) performed relatively well considering the complexity of the diagnosis, which is a testament to the quality of their craftsmanship. Rule 1,2 and 3 had a sensitivity and specificity of 98.5/26.7, 95.5/30.6, and 97.0/35.6 respectively. The authors felt that none of these were sensitive enough for clinical use, so in an attempt to build a rule with 100% sensitivity they did the unthinkable… They dredged their data.

Again using recursive partitioning the authors derived a new rule which they officially called the “Ottawa SAH rule”. After all the first step in building any good decision rule is to come up with a catchy name. The Ottawa SAH rule (OSR) retained all the variables from Rule 1 and added thunderclap headache (defined as peak onset within 1 second) and limited neck flexion on examination. Using this new rule they were able to identify a population who could be safely excluded from having a SAH. Unfortunately this dropped the rule’s specificity down to 15.3%. This “Franken-rule” was assembled from the remnants of the original three rules, which were deconstructed for not performing up to par. In a sense this is the derivation cohort from which this new rule was built. In order for OSR to be used clinically it now has to be validated in a new cohort. The authors attempt to address these concerns by performing a bootstrap analysis. Simply put they randomly sampled patients from their original derivation cohort (n=1999) and retrospectively tested the performance of the OSR in this “new sample”. After repeating this process 1000 times they deemed the OSR stable and valid. Though bootstrapping is a valid and elegant way of testing a rule’s stability, it is prone to overestimating a rule’s performance and is not as strong a method of validation as one done on an independent sample (3).

The more important question is not how the rule independently performs but rather how it performs when compared to the clinician’s unstructured decision-making (5). There is a history of well-derived rules with excellent test characteristics that when used clinically paradoxically increase utilization of resources (8). In this cohort the physicians using unstructured judgment did a reasonably good job at identifying who to workup for SAH. The rate of patients who underwent some form of a diagnostic workup was 84.3%. Though their follow-up was not perfect it does not appear that any patients not scanned or tapped returned with a ruptured SAH. Assuming this was the case, unstructured judgment had sensitivity equal to that of the OSR and an equivalent specificity of 15.7%. Ironically the decision rule that may be the most clinically relevant is Rule 1. Overall Rule 1 missed 2 out of 132 SAHs. One of these two missed cases was non-aneurysmal, had no surgical management and did fine. If we agree to not concern ourselves with these types of SAH then Rule 1’s sensitivity becomes 99.2%. We will miss less than one SAH for every 100 patients cleared with this rule, which is actually under the test threshold of 1% that David Newman proposed on his podcast, Smart EM, at ACEP 2012 and in print in the December 2012 issue of EM monthly (6). In these various forms of media Dr. Newman elegantly describes the harm that will occur from over-diagnosis and treatment when initiating the diagnostic pathway to exclude SAH too frequently. He concludes that once you have reached a risk of less than 1% more patients will be harmed than helped from further testing. (6) If we agree this as an acceptable threshold, and seemingly many physicians do (7), than Rule 1 will risk stratify our patients below this test threshold and may outperform the Emergency Physician’s unstructured decision making.

Unfortunately one final step is necessary before Rule 1 is ready for clinical use. In this validation cohort the authors attempted to validate three rules. In attempting to validate multiple rules, the authors may in fact have weakened their results. The more instruments that are examined the greater the likelihood that one will perform well by chance alone. It is the equivalent of having three primary endpoints for a trial and claiming to have rejected the null hypothesis when one of these endpoints reaches statistical significance. It may be in the next cohort examined that Rule 2 outperforms its competitors. If we are going to use Rule 1 in the clinical arena then it should be independently validated in a separate cohort to ensure it consistently risk stratifies patients below the test threshold.

Decision rules for answering simple questions, when built for stability and utility are wonderful tools that stand the test of time. The more complex the question one attempts to answer using a decision rule the more complex the rule being built becomes. As the complexity of a rule grows the inherent weakness of the derivation process becomes more and more apparent. In these cases the rule is based more on random chance then on reality and is destined to regress towards mediocrity when attempts to validate its efficacy are made. It is important before attempting such a rule to define our expectations. There must be an acceptable miss rate in any decision rule. Striving for perfection will only lead to complex rules that are overfitted, unwieldy and will never perform effectively when applied to the general population.

Sources Cited:

Perry JJ, Stiell IG, Sivilotti MA, et al. Clinical Decision Rules to Rule Out Subarachnoid Hemorrhage for Acute Headache. JAMA. 2013;310(12):1248-1255.
Perry JJ, Stiell IG, Sivilotti ML, et al. High-risk clinical characteristics for subarachnoid haemorrhage in patients with acute headache: prospective cohort study. BMJ. 2010
David L. Schriger. Some Thoughts on the Stability of Decision Rules. Annals of Emergency Medicine – March 2007 (Vol. 49, Issue 3, Pages 333-334
Roger J. Lewis, M.D., Ph.D. An Introduction to Classification and Regression Tree (CART) Analysis. Presented at the 2000 Annual Meeting of the Society for Academic Emergency Medicine in San Francisco, California
David L. Schriger MD,MPH, David H. Newman, MD. Medical Decisionmaking: Let’s Not Forget the Physician. Annals of Emergency Medicine Volume 59, Issue 3 , Pages 219-220, March 2012
David Newman, MD. LP for Subarachnoid Hemorrhage: The 700 Club. EP Monthly December 4, 2012
Perry JJ,Eagles D,Clement CM, etal. An international study of emergency physicians’ practice for acute headache management and the need for a clinical decision rule. CJEM. 2009;11(6):516-522.
Ian G. Stiell et al. A prospective cluster-randomized trial to implement the Canadian CT Head Rule in emergency departments CMAJ October 5, 2010 182:1527-1532; published ahead of print August 23, 2010