I'm a bit skeptical of the value of screening toxicology patients by measuring an osmolar gap. So I looked for literature about the test's performance. Lynd 2008 found that an elevated osmolar gap had a sensitivity of 90% and a specificity of 22% (1). That's about what I might have guessed – the test is fairly sensitive and very poorly specific (it can be elevated due to a panoply of substances). So, it's probably a reasonable screening test to rule-out toxic alcohol ingestion, but not as a diagnostic test to rule-in toxic alcohol ingestion.
Wrong. Based on 90% sensitivity and 22% specificity, the test has a positive likelihood ratio (+LR) of 1.15 and a negative likelihood ratio (-LR) of 0.45. These are weak likelihood ratios, of little help clinically. For example, suppose that we're trying to use osmolar gap to exclude toxic alcohol ingestion in a patient with a 10% pre-test probability. If the osmolar gap is normal, then the post-test probability decreases to 5%, which doesn't really rule this out.
Thus, looking at the sensitivity and specificity (90% & 22%) gives us an entirely different conceptualization of the test compared to looking at the positive and negative likelihood ratios (1.15 & 0.45).
Example #2: 90% sensitivity & 10% specificity
For the sake of argument, let's imagine that a diagnostic test had a sensitivity of 90% and a specificity of 10%. This may seem silly, but it's not that much different from the osmolar gap data above. What is the utility of this test?
You might be tempted to say that the test has a reasonable sensitivity, so it could be used to rule-out disease. Nope. In fact, this test has a positive likelihood ratio of 1 and a negative likelihood ratio of 1. The test is completely worthless.
The figure below illustrates why this happens. Let's imagine that a woman has a 50-50 pre-test probability of having a disease. The test result is negative. However, a negative test result is equally likely whether or not she has the disease. Therefore, her post-test probability is unchanged at 50%.
Taking this a step farther, any test where the % sensitivity plus the % specificity add up to 100% will be a worthless test (+LR = -LR = 1). In this situation, either test result (positive or negative) has the same probability of occurring whether or not the patient has the disease. For example, a test with 75% sensitivity and 25% specificity is worthless. We will return to this concept of bullshit tests later on.
Mythbusting SPin & SNout
We've been taught in statistics that the sensitivity of a test determines its ability to rule-out disease, whereas the specificity of a test determines its ability to rule-in disease:
This is often taught with the mnemonic SPin and SNout (for SPecificity-rule-IN, SeNsitivity-rule-OUT). This concept is so widespread that the mnemonic is even popular in countries where English isn't the primary language. Unfortunately this is wrong.
- Sensitivity is the probability of a positive test within a group of patients who have the disease.
- Specificity is the probability of a negative test within a group of patients who don't have the disease.
Thus, sensitivity & specificity predicts whether the test will be positive, given that the patient either has or doesn't have the disease. This is the exact opposite of what we're interested in. What we need to know is whether the patient has the disease, given that the test is either positive or negative. Translating information from the test to information about the patient requires a likelihood ratio:
In short, sensitivity/specificity are more test-centered, whereas likelihood ratios are more patient-centered. Fortunately, it's easy to calculate likelihood ratios from sensitivity and specificity using the following formulas:
Likelihood ratios measure the test's true ability to rule-in (+LR) or rule-out (-LR) disease. From the equations above, it's clear that sensitivity and specificity both have effects on both the +LR and the -LR:
For example, the ability to rule-out disease is a function of both sensitivity and specificity. Sensitivity is a bit more important, but not by far. For example, even if the sensitivity is good (say, 90%), if the specificity is bad enough (say, 10%) then the poor specificity sabotages the ability of the test to rule-out disease.
For further proof that the SNout/SPin paradigm doesn't work, consider the following two diagnostic tests:
- Test #1: Sensitivity 90%, Specificity 70%
- Test #2: Sensitivity 30%, Specificity 90%
Which of these tests, if positive, provides stronger evidence that the patient has the disease? According to SPin, Test #2 has a higher specificity, and therefore it should be better at ruling-in disease. Wrong. Both tests have a positive likelihood ratio of 3. If either test is positive, it has exactly the same effect on the post-test probability.
Gross anatomy of diagnostic tests
Likelihood ratios may be more useful clinically, but sensitivity and specificity are more widely reported. Therefore, it is useful to gain a general understanding of how sensitivity and specificity translate into likelihood ratios.
This requires defining some general cutoff values for likelihood ratios. These are admittedly arbitrary, but they will give us some rough boundaries to work with (2):
- Weak impact: +LR between 1-3 or a -LR between 1-1/3
- Moderate impact: +LR between 3-10 or a -LR between 1/3-1/10
- Strong impact: +LR >10 or a -LR <1/10
This allows us to map out the relationship between sensitivity/specificity and likelihood ratios:
An interesting phenomenon occurs along the bullshit line. Any test lying along this line will have a positive and negative likelihood ratio of one, making it completely worthless. This phenomenon was explored earlier in example #2.
Below the bullshit line (in Category #10) something horrifying happens: tests actually become misleading. For example, let's consider a test with a sensitivity of 80% and a specificity of 10%. This test has a positive likelihood ratio of 0.89 and a negative likelihood ratio of 2. That's right: if the test is positive, it decreases the likelihood of disease. If the test is negative it increases the likelihood of disease. This test isn't just worthless; it's actively misleading.
How is this possible? Let's consider a man with a 50-50 pre-test probability of disease who undergoes a diagnostic test with 80% sensitivity and 10% specificity (figure below). The test comes back negative. It is actually more likely to get a negative test result if he has the disease (20%) than if he's disease-free (10%). Therefore, a negative test result will increase his probability of having disease from 50% to 67% (3).
It's eye-opening to realize how many tests with reasonable-appearing sensitivity or specificity will end up being worthless or weak. The figure below classifies tests into four groups depending on the strongest level of evidence that they can provide:
Only tests falling along the very edge of this plot (green area) are capable of providing strong evidence. Most tests will end up providing misleading or weak results, which often occurs despite having a specificity or sensitivity >80%.
- Sensitivity/specificity reveal information about test performance, whereas likelihood ratios reveal information about the significance of a test result for an individual patient. As such, likelihood ratios are a more clinically relevant and patient-centered way to understand diagnostic tests.
- It is widely believed that the sensitivity of a test drives its ability to rule-out disease, whereas the specificity of a test drives its ability to rule-in disease. This is incorrect. Both sensitivity and specificity are jointly involved in the ability of a test to rule-in (+LR) or rule-out (-LR) disease.
- Despite having a high sensitivity (e.g. 90%), a low specificity (e.g. 10%) may destroy the value of the test, rendering it completely meaningless.
- It's possible to map likelihood ratios onto a graph of sensitivity vs. specificity, creating a visual representation of how sensitivity interacts with specificity to affect test performance (figure below). This illustrates how it is possible for tests with a high sensitivity or specificity (e.g. >80%) to be non-diagnostic or even misleading.
- There's a lot more to this paper and the question of osmolar gap, but this is a post about statistics. We'll get back to the osmolar gap issue later.
- These cutoffs are based on cutoffs which Steven McGee used in his landmark book on Evidence Based Physical Diagnosis (Second Edition). He argued that a cutoff of 3 or 1/3 was the boundary of clinical usefulness, because this would often shift the probability of disease by ~20%. This is a bit arbitrary. Regardless, I needed some sort of cutoff for this post and I didn't want to choose them out of thin air, so I used McGee's cutoffs. In reality, the best way to apply likelihood ratios with an individual patient is to consider the patient's pre-test probability of the disease, the likelihood ratio, the test threshold for the disease (the probability below which further testing is nonbeneficial), and the treatment threshold of the disease (the probability above which treatment is indicated and further testing isn't likely to change management).
- In practice, a test should never fall into Zone #10. It would take a considerable amount of oversight to allow this to happen, because the likelihood of a positive result is greater in patients without disease. If this scenario did occur, the test should be abandoned or the significance of a positive test result reversed, which would move the test out of Zone #10 probably into Zone #9. For example, in the case described here you could re-define the test so that a “negative” test was used to reveal the presence of disease. Re-defining the test in this way would give the redefined test a sensitivity of 20% and a specificity of 90%, with a +LR of 2 and a -LR of 0.89.