#### Example #1: Osmolar Gap

I'm a bit skeptical of the value of screening toxicology patients by measuring an osmolar gap. So I looked for literature about the test's performance. Lynd 2008 found that an elevated osmolar gap had a sensitivity of 90% and a specificity of 22% (1). That's about what I might have guessed – the test is fairly sensitive and very poorly specific (it can be elevated due to a panoply of substances). So, it's probably a reasonable *screening* test to rule-out toxic alcohol ingestion, but not as a *diagnostic* test to rule-in toxic alcohol ingestion.

Wrong. Based on 90% sensitivity and 22% specificity, the test has a positive likelihood ratio (+LR) of 1.15 and a negative likelihood ratio (-LR) of 0.45. These are weak likelihood ratios, of little help clinically. For example, suppose that we're trying to use osmolar gap to exclude toxic alcohol ingestion in a patient with a 10% pre-test probability. If the osmolar gap is normal, then the post-test probability decreases to 5%, which doesn't really rule this out.

Thus, looking at the sensitivity and specificity (90% & 22%) gives us an entirely different conceptualization of the test compared to looking at the positive and negative likelihood ratios (1.15 & 0.45).

#### Example #2: 90% sensitivity & 10% specificity

For the sake of argument, let's imagine that a diagnostic test had a sensitivity of 90% and a specificity of 10%. This may seem silly, but it's not that much different from the osmolar gap data above. What is the utility of this test?

You might be tempted to say that the test has a reasonable sensitivity, so it could be used to rule-out disease. Nope. In fact, this test has a positive likelihood ratio of 1 and a negative likelihood ratio of 1. The test is *completely worthless*.

The figure below illustrates why this happens. Let's imagine that a woman has a 50-50 pre-test probability of having a disease. The test result is negative. However, a negative test result is *equally likely* whether or not she has the disease. Therefore, her post-test probability is unchanged at 50%.

Taking this a step farther, any test where the *% sensitivity* plus the *% specificity* add up to 100% will be a worthless test (+LR = -LR = 1). In this situation, either test result (positive or negative) has the same probability of occurring whether or not the patient has the disease. For example, a test with 75% sensitivity and 25% specificity is worthless. We will return to this concept of bullshit tests later on.

### Mythbusting SPin & SNout

We've been taught in statistics that the *sensitivity* of a test determines its ability to *rule-out* disease, whereas the *specificity* of a test determines its ability to *rule-in* disease:

This is often taught with the mnemonic SPin and SNout (for SPecificity-rule-IN, SeNsitivity-rule-OUT). This concept is so widespread that the mnemonic is even popular in countries where English isn't the primary language. Unfortunately this is wrong.

- Sensitivity is the probability of a positive test within a group of patients who have the disease.
- Specificity is the probability of a negative test within a group of patients who don't have the disease.

Thus, sensitivity & specificity predicts whether the *test* will be positive, given that the patient either has or doesn't have the disease. This is the exact *opposite* of what we're interested in. What we need to know is whether the *patient* has the disease, given that the test is either positive or negative. Translating information from the test to information about the patient requires a likelihood ratio:

In short, sensitivity/specificity are more test-centered, whereas likelihood ratios are more patient-centered. Fortunately, it's easy to calculate likelihood ratios from sensitivity and specificity using the following formulas:

Likelihood ratios measure the test's true ability to rule-in (+LR) or rule-out (-LR) disease. From the equations above, it's clear that sensitivity and specificity *both* have effects on *both* the +LR and the -LR:

For example, the ability to rule-out disease is a function of *both* sensitivity *and* specificity. Sensitivity is a bit more important, but not by far. For example, even if the sensitivity is good (say, 90%), if the specificity is bad enough (say, 10%) then the poor specificity sabotages the ability of the test to rule-out disease.

For further proof that the SNout/SPin paradigm doesn't work, consider the following two diagnostic tests:

- Test #1: Sensitivity 90%, Specificity 70%
- Test #2: Sensitivity 30%, Specificity 90%

Which of these tests, if positive, provides stronger evidence that the patient has the disease? According to SPin, Test #2 has a higher specificity, and therefore it should be better at ruling-in disease. Wrong. Both tests have a positive likelihood ratio of 3. If either test is positive, it has *exactly the same* effect on the post-test probability.

### Gross anatomy of diagnostic tests

Likelihood ratios may be more useful clinically, but sensitivity and specificity are more widely reported. Therefore, it is useful to gain a general understanding of how sensitivity and specificity translate into likelihood ratios.

This requires defining some general cutoff values for likelihood ratios. These are admittedly arbitrary, but they will give us some rough boundaries to work with (2):

- Weak impact: +LR between 1-3 or a -LR between 1-1/3
- Moderate impact: +LR between 3-10 or a -LR between 1/3-1/10
- Strong impact: +LR >10 or a -LR <1/10

This allows us to map out the relationship between sensitivity/specificity and likelihood ratios:

An interesting phenomenon occurs along the bullshit line. Any test lying along this line will have a positive and negative likelihood ratio of one, making it completely worthless. This phenomenon was explored earlier in example #2.

Below the bullshit line (in Category #10) something horrifying happens: tests actually become *misleading*. For example, let's consider a test with a sensitivity of 80% and a specificity of 10%. This test has a *positive* likelihood ratio of 0.89 and a *negative* likelihood ratio of 2. That's right: if the test is positive, it *decreases* the likelihood of disease. If the test is negative it *increases* the likelihood of disease. This test isn't just worthless; it's actively misleading.

How is this possible? Let's consider a man with a 50-50 pre-test probability of disease who undergoes a diagnostic test with 80% sensitivity and 10% specificity (figure below). The test comes back negative. It is actually more likely to get a negative test result if he has the disease (20%) than if he's disease-free (10%). Therefore, a *negative* test result will *increase* his probability of having disease from 50% to 67% (3).

Returning to our initial example of the osmolar gap, with a sensitivity of 90% and specificity of 22% it will fall into a weak impact zone:

It's eye-opening to realize how many tests with reasonable-appearing sensitivity or specificity will end up being worthless or weak. The figure below classifies tests into four groups depending on the strongest level of evidence that they can provide:

Only tests falling along the *very edge* of this plot (green area) are capable of providing strong evidence. Most tests will end up providing misleading or weak results, which often occurs despite having a specificity or sensitivity >80%.

- Sensitivity/specificity reveal information about test performance, whereas likelihood ratios reveal information about the significance of a test result for an individual patient. As such, likelihood ratios are a more clinically relevant and patient-centered way to understand diagnostic tests.
- It is widely believed that the sensitivity of a test drives its ability to rule-out disease, whereas the specificity of a test drives its ability to rule-in disease. This is incorrect.
*Both*sensitivity*and*specificity are*jointly*involved in the ability of a test to rule-in (+LR) or rule-out (-LR) disease. - Despite having a high sensitivity (e.g. 90%), a low specificity (e.g. 10%) may destroy the value of the test, rendering it completely meaningless.
- It's possible to map likelihood ratios onto a graph of sensitivity vs. specificity, creating a visual representation of how sensitivity interacts with specificity to affect test performance (figure below). This illustrates how it is possible for tests with a high sensitivity or specificity (e.g. >80%) to be non-diagnostic or even misleading.

##### Notes

- There's a
*lot*more to this paper and the question of osmolar gap, but this is a post about statistics. We'll get back to the osmolar gap issue later. - These cutoffs are based on cutoffs which Steven McGee used in his landmark book on Evidence Based Physical Diagnosis (Second Edition). He argued that a cutoff of 3 or 1/3 was the boundary of clinical usefulness, because this would often shift the probability of disease by ~20%. This is a bit arbitrary. Regardless, I needed some sort of cutoff for this post and I didn't want to choose them out of thin air, so I used McGee's cutoffs. In reality, the best way to apply likelihood ratios with an individual patient is to consider the patient's pre-test probability of the disease, the likelihood ratio, the
*test threshold*for the disease (the probability below which further testing is nonbeneficial), and the*treatment threshold*of the disease (the probability above which treatment is indicated and further testing isn't likely to change management). - In practice, a test should never fall into Zone #10. It would take a considerable amount of oversight to allow this to happen, because the likelihood of a positive result is
*greater*in patients without disease. If this scenario did occur, the test should be abandoned or the significance of a positive test result*reversed*, which would move the test out of Zone #10 probably into Zone #9. For example, in the case described here you could re-define the test so that a “negative” test was used to reveal the presence of disease. Re-defining the test in this way would give the redefined test a sensitivity of 20% and a specificity of 90%, with a +LR of 2 and a -LR of 0.89.

### Josh Farkas

#### Latest posts by Josh Farkas (see all)

- PulmCrit:Algorithm for diagnosing ICP elevation with ocular sonography - October 9, 2017
- PulmCrit:Large-bore suction for intubation: strategies & devices - September 25, 2017
- PulmCrit:The ketamine-tolerant patient - September 11, 2017

## Comment Here

33 Comments on "PulmCrit – Mythbusting sensitivity and specificity"

It is not possible to have a sensitivity or specificity less than 50%. If the calculated specificity or sensitivity is less than 50% then the test should be interpreted in the opposite direction and the spec/sens recalculated as (100-calculated). Similarly, it is not possible to have a LR+ less than 1 nor a LR- greater than 1.

Haha exactly what i was coming to say!

If one has a Se=.01 and Sp=0.1, the test is hardly in the “bullshit zone”, its actually a phenomenal test that needs the signs flipped.

The opposite of a LR of 12 isnt a LR of .01 (or -5…), its a LR of 1.

please see footnote #3.

it is possible to have a sensitivity & specificity <50%

you have to be dumb to design a test this way, but it is mathematically possible

my next posting (july 4) will feature a commonly used physical exam sign which is useless when used as typically recommended, but can be flipped around to have an inverse significance which is useful.

The accuracy of a lots of tests depend on the point at which they are ordered during the course of the disease e.g. troponin, rheumatoid factor.

Furthermore evaluating test performance requires a comparative gold standard that ultimately distinguishes ‘disease’ from ‘non-disease’. I am not sure all tests that we routinely use have been evaluated under those standards.

Correction

My regrets. I meant to say mature neutrophils MARGINATE (adhere to the wall) and the leave the circulation better than immature neutrophils (bands).

Another important point is that the immature granulocytes from automated counters are myelocyes and more left and does not include bands and is therefore NOT a substitute for bands.

One more correction (sry)

Immature granulocytes (the IG on automated counters ) are Metamyelocytes and left.

Ack! Apologies for my mistakes:

* negative likelihood ratio -> LOW negative likelihood ratio

True positive given test: 100 patients

False positive given test: 1000 patients

True negative given test: 100,000 patients

False negative given test: 10 patients

Sensitivity = 100/(100+10) = 0.909

Specificity = 100,000/(100,000+1000) = 0.990

+LR = 0.909/(1-0.990) = 90.9

-LR = (1-0.909)/0.990 = 0.092

However:

Precision = 100/(100+1000) = 0.091 is very low

Ah, I see. Thank you for patiently and politely explaining everything for me!

Very nice. Thank you.

Here is Paulker’s seminal reference for threshold decision making (threshold science). This combined with the science of the ROC and the LR revolutionized testing,

https://www.ncbi.nlm.nih.gov/m/pubmed/7366635/?i=11&from=pauker%20%22threshold%20decision%20making%22

However, like most new enlightening scientific concepts the ROC and threshold science have been generalized as laws of nature and gained the staus of dogma. This reference shows why such concepts are too linear for many critical care and ER testing situations and become pitfalls when appied as dogma in the study or bedside care of the more complex dynamic relational conditions like sepsis

https://www.ncbi.nlm.nih.gov/m/pubmed/24834126/?i=4&from=lynn%20la

Daniel Kahnemann in Fast Thinking/Slow thinking that a necessary condition for expert intuition to become reliable is if events are sufficiently predictable such that patterns can emerge.

Are you arguing that sepsis is sufficiently unpredictable that in fact neither a simple score (SOFA) or an expert (such as yourself) can reliably respond to it in a rational fashion – nor can adequately explain the phenomena with a sufficient number of expert-derived heuristics.

‘Dynamic diagnostic relationism’

Is that just a fancy way of saying that the value of a test is dependent on where you think the course of the illness (untreated or treated) is at the time you ordered it?

I suspect most clinicians would be cognisant of that principle whether they were involved in acute or chronic medicine.

Yeah, absolutely. So the pre-test probability balances the target classes (probability of not having disease isn’t overwhelming probability of having disease) probably based on the patient’s history or symptoms or previous tests or something, which makes the -/+LR approach valid since the target classes are more balanced. As Smith pointed out, that approach is appropriate in a hospital setting where the physician determined that the patient has a decent likelihood of having the disease, but not in screening for disease in a low-prevalence population, correct?

I learned today that precision is referred to as PPV in medicine, ha ha.

For a single test in isolation, I would agree that a +/- LR is only useful if there’s a reasonable pretest probability, as described would be frequently the case in a hospital setting. However tests can also be chained together using the first test’s posttest probability for the next test’s pretest probability (this does make the assumption the tests are 100% independent of one another).

For screening, a disease (I don’t have a good example off the top of my head) might have a very low risk/cost test and a higher risk/cost test. The first could be used for screening – if negative the -LR and starting relatively low probability are good enough to perform no further testing as the disease is extremely unlikely. If positive, the +LR probably is not enough to diagnose the disease, but the probability of disease is then high enough that it is worth performing the higher risk/cost test which would then be used to diagnose or rule out disease.

Sounds reasonable. The only situation where -LR might mislead would be if the population’s disease prevalence was really high, but that seems unlikely to happen, ha ha.

Then it may very well be possible you have reached the decision threshold to treat or not treat without needing to order the test at all!

What a wild thought. Clinical medicine.

Thanks Josh,

Your graph was an elegant way of demonstrating the graphical relationship between sensitivity, specificity, likelihood ratios and the ROC curve.

I always had a sneaking suspicion that the SpIN and SnOUT mnemonic didn’t always hold true but hadn’t paid to much thought to the mathematics of it all.

However, it is now obvious from surveying your graph that the conditions to which these ‘rules’ apply is when the complementary parameter is >90%. SpIN is true if sensitivity is >90%. SnOUT is true if specificity > 90%.

It breaks down considerably when either sensitivity or specificity of the test is <75%. This and your sens+spec < 1 rule should weed out the more useless tests.

The caveat to using Likelihood ratios or Predictive Values to making clinical decision is having some feel of the original pre-test probability. I always remind juniors that before launching into a test that it is always important to start of with a good clinical evaluation and risk assessment.

GIGO

I’ve realised I made an error in reading your graph. SpIN is true (for all values of sensitivity) as long as Specificity > 90%. SnOUT is true as long (for all values of specificity) as long as Sensitivity is >90%.

Except when the patient’s pre-test probability of having the disease is low for SpIN. 😉

I think you are describing positive predictive value

Beyond the mathematics of how information interacts to provide us estimates of the probability of the presence or absence of disease, the decision to act on it is also dependent on the consequences of intervening or not e.g. missing an unstable C-spine fracture is different from missing a fractured nose.

I suspect most variability in clinical practice amongst experts isn’t due to a gross miscalculation of the probabilities but the degree to which the individual clinician perceives as an acceptable level of risk or benefit. e.g. stroke lysis

Very interesting and well said. However, I think it falls into the same reductionist trap that most of these discussions find. Almost no test has much decision-making value alone, except for some of the real decerebrate powerhouses (CT scans and the like). Most of the meaningful diagnoses we make are based upon a constellation of findings, from all the little twinkles in a history and physical exam (and surely, each time you inspect or palpate a toe, you are performing a test however small) to each row and column of the diagnostic studies. Statistically we try to model this by saying it is a chain of Bayesian probability — each test providing a pre-test probability for the next — but really it is hard to reify. (I’m not suggesting it is somehow mystical, which we sometimes want to indicate when we start using words like “gestalt,” merely that it’s rather ineffable, since our internal processors tend to run in parallel rather than in series.)

A hundred “tests,” each with poor likelihood ratios, may in the end break a useful treatment threshold, even without suggesting any synergy between them — merely by summation.

So forget the math, what does this mean at the bedside.

1. The earlier a physcian suspects sepsis and orders the tests of Sepsis 3, the more likely the tests will fail to meet threshold values and be falsely negative.. Failure to order a follow-up set of tests on such cases will be catastrophic if reliance is placed on Sepsis 3.

2. The later a physical suspects and orders a WBC to determine the presence or absence of a severe infection, the more likely it will be a false negative (normal test). Reliance on the WBC and Failure to order bands (which have the opposite sensitivity function in the time domain) can be catastophic.

Failure to know the basic sensiitivity functions of thresholds (in the time domain) renders a false sense that one is applying these metrics in a scientific way.

Same as when you try to diagnose rheumatoid arthritis with a rheumatoid factor

Hmm. ..

Well I am not sure the RF often normalizes as rheumatoid arthritis severity and duration increases as the WBC count often does in sepsis, but I could be convinced with a good reference.