Bayesian reasoning in clinical diagnostics: a primer.

We know, from the source of eternal wisdom that is Saturday Morning Breakfast Cereal, that insufficient math education is the basis of the entire Western economy.1 This makes Bayesian logic and reasoning about probabilities almost like a dark art, a well-kept secret that only a few seem to know (and it shouldn’t be… but that’s a different story). This weird-wonderful argument, reflecting a much-reiterated meme about vaccines and vaccine efficacy, is a good example:

The argument, here, in case you are not familiar with the latest in anti-vaccination fallacies, is that vaccines don’t work, and they have not reduced the incidence of vaccine-preventable diseases. Rather, if a person is vaccinated for, say, measles, then despite displaying clinical signs of measles, he will be considered to have a different disease, and therefore all disease statistics proving the efficacy of vaccines are wrong. Now, that’s clearly nonsense, but it highlights one interesting point, one that has a massive bearing on computational systems drawing conclusions from evidence: namely, the internal Bayesian logic of the diagnostic process.

Which, incidentally, is the most important thing that they didn’t teach you in school. Bayesian logic, that is. Shockingly, they don’t even teach much of it in medical school unless you do research, and even there it’s seen as a predictive method, not a tool to make sense of analytical process. Which is a pity. The reason why idiotic arguments like the above by @Cattlechildren proliferate is that physicians have been taught how to diagnose well, but never how to explain and reason about the diagnostic process. This was true for the generations before me, and is more or less true for those still in med school today. What is often covered up with nebulous concepts like ‘clinical experience’ is in fact solid Bayesian reasoning. Knowing the mathematical fundamentals of the thought process you are using day to day, and which help you make the right decisions every day in the clinic, helps you reason about it, find weak points, answer challenges and respond to them. For this reason, my highest hope is that as many MDs, epidemiologists, med students, RNs, NPs and other clinical decision-makers will engage with this topic, even if it’s a little long. I promise, it’s worth it.

Some basic ideas about probability

In probability, an event, usually denoted with a capital and customarily starting at A (I have no idea why, as it makes things only more confusing!), is any outcome or incidence that we’re interested in – as long as they’re binary, that is, they either happen or don’t happen, and discrete, that is, there’s a clear definition for it, so that we can decide if it’s happened or not – no half-way events for now.2 In other words, an event can’t happen and not happen at the same time. Or, to get used to the notation of conditionality, p(A \mid \neg A) = 0.3 A thing cannot be both true and false.

Now, we may be interested in how likely it is for an event to happen if another event happens: how likely is A if B holds true? This is denoted as p(A|B), and for now, the most important thing to keep in mind about it is that it is not necessarily the same as p(B|A)!4

Bayesian logic deals with the notion of conditional probabilities – in other words, the probability of one event, given another.5 It is one of the most widely misunderstood part of probability, yet it is crucial to understand to our own idea of the way we reason about things.

Just to understand how important this is, let us consider a classic example.

Case study 1: speed cameras

Your local authority is broke. And so, it does what local authorities do when they’re broke: play poker with the borough credit card set up a bunch of speed cameras and fine drivers. Over this particular stretch of road, the speed limit is 60mph.

According to the manufacturer, the speed cameras are very sensitive, but not very specific. In other words, they never falsely indicate that a driver was below the speed limit, but they may falsely indicate that the driver was above it, in about 3% of the cases (the false positive rate).

One morning, you’re greeted by a message in your postbox, notifying you that you’ve driven too fast and fining you a rather respectable amount of cash. What is the probability that you have indeed driven too fast?

You may feel inclined to blurt out 97%. That, in fact, is wrong.


It’s rather counter-intuitive at first to understand why, until we consider the problem in formal terms. We know the probability p(A|\not B), that is, the probability of being snapped (A) even though you were not speeding (\not B). But what the question asks is what the likelihood that you were, in fact, speeding (B) given the fact that you were snapped (A). And as we have learned, the conditional probability operator is not commutative, that is, p(A|B) is not necessarily the same as p(B|A).

Why is that the case? Because base rates matter. In other words, the probabilities of A and B, in and of themselves, are material. Consider, for a moment, the unlikely scenario of living in that mythical wonderland of law-abiding citizens where nobody speeds. Then, it does not matter how many drivers are snapped – all of them are false positives, and thus p(B|A), the probability of speeding (B) given that one got snapped by a speed camera (A), is actually zero.

In other words, if we want to reverse the conditional operator, we need to make allowances for the ‘base frequency’, the ordinary frequency with which each event occurs on its own. To overcome base frequency neglect,6 we have a mathematical tool, courtesy of the good Revd. Thomas Bayes, who sayeth that, verily,

$latex p(B \mid A) = \frac{p(A \mid B) p(B)}{p(A)}

Or, in words: if you want to reverse the probabilities, you will have to take the base rates of each event into account. If what we know is the likelihood that you were not speeding if you were snapped and what we’re interested in is the likelihood that someone getting snapped is indeed speeding, we’ll need to know a few more things.

Case study 1: Speed cameras – continued

  • We know that the speed cameras have a Type II (false negative) error rate of zero – in other words, if you are speeding (B), you are guaranteed to get snapped (A) – thus, $p(A \mid B)$ is 1.
  • We also know from the Highway Authority, who were using a different and more accurate measurement system, that approximately one in 1,000 drivers is speeding (p(B) = 0.001).
  • Finally, we know that of 1,000 drivers, 31 will be snapped – the one speeder and 3% accounting for the false positive rate –, yielding p(A) = 0.031.

Putting that into our equation,

p(B|A) = \frac{p(A \mid B) p(B)}{p(A)} = \frac{1 \cdot 0.001}{0.031} = 0.032

In other words, the likelihood that we indeed did exceed the speed limit is just barely north of 3%. That’s a far cry from the ‘intuitive’ answer of 97% (quite accidentally, it’s almost the inverse).

Diagnostics, probabilities and Bayesian logic

The procedure of medical diagnostics is ultimately a relatively simple algorithm:

  1. create a list of possibilities, however remote (the process of differential diagnostics),
  2. order them in order of likelihood,
  3. update priors as you run tests.7

From a statistical perspective, this is implemented as follows.

  1. We begin by running a number of tests, specifically m of them. It is assumed that the tests are independent from each other, i.e. the value of one does not affect the value of another. Let R_j denote the results of test $j \leq m$.
    1. For each test, we need to iterate over all our differentials D_{i \ldots n}, and determine the probability of each in light of the new evidence, i.e. $latex p(D_i \mid R_j).
    2. So, let’s take the results of test j that yielded the results R_j, and the putative diagnosis D_i. What we’re interested in is p(D_i \mid R_j), that is, the probability of the putative diagnosis given the new evidence. Or, to use Bayesian lingo, we are updating our prior: we had a previous probability assigned to D_i, which may have been a uniform probability or some other probability, and we are now updating it – seeing how likely it is given the new evidence, getting what is referred to as a posterior.8
    3. To calculate the posterior P(D_i | R_j), we need to know three things – the sensitivity and specificity of the test j (I’ll call these S^+_j and S^-_j, respectively), the overall incidence of D_i,9 and the overall incidence of the particular result R_j.
    4. Plugging these variables into our beloved Bayesian formula, we get p(D_i \mid R_j) = \frac{p(R_j \mid D_i) p(D_i)}{p(R_j)}.
    5. We know that p(R_j \mid D_i), that is, the probability that someone will test a particular way if they do have the condition D_i, is connected to sensitivity and specificity: if R_j is supposed to be positive if the patient has D_i, then p(R_j \mid D_i) = S^-_j (sensitivity), whereas if the test is supposed to be negative if the patient has D_i, then p(R_j \mid D_i) = S^+_j (specificity).
    6. We also know, or are supposed to know, the overall incidence of D_i and the probability of a particular outcome, R_j. With that, we can update our prior for D_i \mid R_j.
  2. We iterate over each of the tests, updating the priors every time new evidence comes in.

This may sound daunting and highly mathematical, but in fact most physicians have this down to an innate skill, so much so that when I explained this to a group of FY2 doctors, they couldn’t believe it – until they thought about how they thought. And that’s a key issue here: thinking about the way we arrive at results is important, because they are the bedrock of what we need to make those results intelligible to others.

Case study 2: ATA testing for coeliac disease

For a worked example of this in the diagnosis of coeliac disease, check Notebook 1: ATA case study. It puts things in the context of sensitivity and specificity in medical testing, and is in many ways quite similar to the above example, except here, we’re working with a real-world test with real-world uncertainties.

There are several ways of testing for coeliac disease, a metabolic disorder in which the body responds to gluten proteins (gliadins and glutenins) in wheats, wheat hybrids, barley, oats and rye. One diagnostic approach looks at genetic markers in the HLA-DQ (Human Leukocyte Antigen type DQ), part of the MHC (Major Histocompatibility Complex) Class II receptor system. Genetic testing for a particular haplotype of the HLA-DQ2 gene, called DQ2.5, can lead to a diagnosis in most patients. Unfortunately, it’s slow and expensive. Another test, a colonoscopic biopsy of the intestines, looks at the intestinal villi, short protrusions (about 1mm long) into the intestine, for tell-tale damage – but this test is unpleasant, possibly painful and costly.

So, a more frequent way is by looking for evidence of an autoantibody called anti-tissue transglutaminase antibody (ATA) – unrelated to this gene, sadly. ATA testing is cheap and cheerful, and relatively good, with a sensitivity (S^+_{ATA}) of 85% and specificity (S^+_{ATA}) of 97%.10 We also know the rough probability of a sample being from someone who actually has coeliac disease – for a referral lab, it’s about 1%.

Let’s consider the following case study. A patient gets tested for coeliac disease using the ATA test described above. Depending on whether the test is positive or negative, what are the chances she has coeliac disease?

Sensitivity and specificity trade-off for an ATA test given various values of true coeliac disease prevalence in the population.

If you’ve read the notebook, you know by now that the probability of having coeliac disease if testing positive is around 22%, or a little better than one-fifth. And from the visualisation to the left, you could see that small incremental improvements in specificity would yield a lot more increase in accuracy (marginal accuracy gain) than increases in sensitivity.

While quite simple, this is a good case study because it emphasises a few essential things about Bayesian reasoning:

  • Always know your baselines. In this case, we took a baseline of 1%, even though the average incidence of coeliac disease in the population is closer to about 0.25% of that. Why? Because we don’t spot-test people for coeliac disease. People who do get tested get tested because they exhibit symptoms that may or may not be coeliac disease, and by definition they have a higher prevalence11 of coeliac disease. The factor is, of course, entirely imaginary – you would, normally, need to know or have a way to figure out the true baseline values.
  • Use independent baselines. It is absolutely crucial to make sure that you do not get the baselines from your own measurement process. In this case, for instance, the incidence of coeliac disease should not be calculated by reference to your own lab’s number of positive tests divided by total tests. This merely allows for further proliferation of false positives and negatives, however minuscule their effect. A good way is to do follow-up studies, checking how many of the patients tested positive or negative for ATA were further tested using other methodologies, many of which may be more reliable, and calculate the proportion of actual cases coming through your door by reference to that.

Case study 3: Vaccines in differential diagnosis

This case is slightly different, as we are going to compare two different scenarios. Both concern D_{VPD}, a somewhat contrived vaccine-preventable illness. D_{VPD} produces a very particular symptom or symptom set, S, and produces this symptom or symptom set in every case, without fail.12 The question is – how does the vaccination status affect the differential diagnosis of two identical patients,13 presenting with the same symptoms S, one of whom is unvaccinated?

No. That’s not how this works. That’s not how ANY of this works. Nnnnope.

It has been a regrettably enduring trope of the anti-vaccination movement that because doctors believe vaccines work, they will not diagnose a patient with a vaccine-preventable disease (VPD), simply striking it off the differential diagnosis or substitute a different diagnosis for it.14 The reality is explored in this notebook, which compares two scenarios, of the same condition, with two persons with the sole difference of vaccination status. That difference makes a massive – about 7,800x – difference between the likelihood of the vaccinated and the unvaccinated person having the disease. The result is that a 7,800 times less likely outcome slides down the differential. As NZ paediatrician Dr Greenhouse (@greenhousemd) noted in the tweet, “it’s good medical care”. In the words of British economist John Maynard Keynes,15 “when the facts change, I change my mind”. And so do diagnosticians.

Quite absolutely simply put: it’s not an exclusion or fudging data or in any sensible way proof that “no vaccine in history has ever worked”. It’s quite simply a reflection of the reality that if in a population a condition is almost 8,000 times less likely, then, yes, other more frequent conditions push ahead.

Lessons learned

Bayesian analysis of the diagnostic procedure allows not only increased clarity about what one is doing as a clinician. Rather, it allows the full panoply of tools available to mathematical and logical reasoning to investigate claims, objections and contentions – and like in the case of the alleged non-diagnosis of vaccines, discard them.

The most powerful tool anyone who utilises any process of structured clinical reasoning – be it clinical reasoning in diagnostics, algorithmic analysis, detective work or intelligence analysis – is to be able to formally reason about one’s own toolkit of structured processes. It is my hope that if you’ve never thought about your clinical diagnostic process in these terms, you will now be able to see a new facet of it.

References   [ + ]

1. The basis of non-Western economies tends to be worse. That’s about as much as Western economies have going for them. See: Venezuela and the DPRK.
2. There’s a whole branch of probability that deals with continuous probabilities, but discrete probabilities are crazy enough for the time being.
3. Read: The probability of A given not-A is zero. A being any arbitrary event: the stock market crashing, the temperature tomorrow exceeding 30ºC, &.
4. In other words, it may be the same, but that’s pure accident. Mathematically, they’re almost always different.
5. It’s tempting to assume that this implies causation, or that the second event must temporally succeed the first, but none of those are implied, and in fact only serve to confuse things more.
6. You will also hear this referred to as ‘base rate neglect’ or ‘base rate fallacy’. As an epidemiologist, ‘rate’ has a specific meaning for us – it generally means events over a span of time. It’s not a rate unless it’s necessarily over time. I know, we’re pedantic like that.
7. This presupposes that these tests are independent of each other, like observations of a random variable. They generally aren’t – for instance, we run the acute phase protein CRP, W/ESR (another acute phase marker) and a WBC count, but these are typically not independent from each other. In such cases, it’s legitimate to use B = B_1 \cap B_2 \cap \ \ldots \cap B_n or, as my preferred notation goes, B = \bigcap^n_{k=1} B_k. I know ‘updating’ is the core mantra of Bayesianism, but knowing what to update and knowing where to simply calculate the conjoint probability is what experts in Bayesian reasoning rake in the big bucks for.
8. Note that a posterior from this step can, upon more new evidence, become the prior in the next round – the prior for j may be the inferred probability p(D_i), but the prior for j + 1 is p(D_i \mid R_j), and so on. More about multiple observations later.
9. It’s important to note that this is not necessarily the population incidence. For instance, the overall incidence and thus the relevant D for EBOV (D_{EBOV}) is going to be different for a haemorrhagic fever referral lab in Kinshasa and a county hospital microbiology lab in Michigan.
10. Lock, R.J. et al. (1999). IgA anti-tissue transglutaminase as a diagnostic marker of gluten sensitive enteropathy. J Clin Pathol 52(4):274-7.
11. More epidemiopedantry: ‘incidence’ refers to new cases over time, ‘prevalence’ refers to cases at a moment in time.
12. This is, of course, unrealistic. I will do a walkthrough of an example of multiple symptoms that each have an association with the illness in a later post.
13. It’s assumed gender is irrelevant to this disease.
14. Presumably hoping that refusing to diagnose a patient with diphtheria and instead diagnosing them with a throat staph infection will somehow get the patient okay enough that nobody will notice the insanely prominent pseudomembrane…
15. Or not…