Here is an incredibly clear, but unfortunately gruesome, example of a variation of Bayes’ Law. A good teachable point.
Consider the recent CDC article “Community and Close Contact Exposures Associated with COVID-19 Among Symptomatic Adults ≥18 Years in 11 Outpatient Health Care Facilities.” It states:
Adults with positive SARS-CoV-2 test results were approximately twice as likely to have reported dining at a restaurant than were those with negative SARS-CoV-2 test results.
If we take (all as working assumptions) a causal link from going out to infection, the study as representative, and equate testing positive for SARS-CoV-2 as the same as having COVID-19, and also take the self-reporting as accurate, this gives how much dining out elevates one’s risk. This connection is so strong that any experienced scientific researcher immediately sees it, so they don’t have to “state the obvious.” But let’s derive it here as an example learning Bayes’ Law (it is theorem, but I was taught it as a “law”).
Bayes’ Law is easy to derive. Let’s introduce some terms:
- P[c] is the probability of having COVID-19. This is called prior.
- P[r] is the probability of having dined at a restaurant in the recent past.
- P[c and r] is the probability of both having COVID-19 and having dined at a restaurant in the recent past. This also equal to P[r and c].
- P[c|r], called “probability of c given r”, is the probability of both having COVID-19, given one has dined at a restaurant in the recent past. It is defined as P[c and r]/P[r].
- P[r|not c], called “probability of r given not c”, is the probability of having dined at a restaurant in the recent past, given one does not have COVID-19. It is defined as P[r and not c]/P[not c]
- P[r|c], called “probability of r given c”, is the probability of having dined at a restaurant in the recent past, given one has COVID-19. It is defined as P[r and c]/P[c]
All of these probabilities are frequencies or rates over a population being examined. We are also assuming all of these quantities are above zero.
We are interested in P[c|r]/P[c], the rate at which restaurant diners in the a possibly new target population have COVID-19 relative to the rate the our target population have COVID-19. We use this as a measure of elevated risk, and if we are similar to the target population this risk may apply to us.
Now the article reported P[r|c]/P[r|not c], not P[c|r]/P[c]. We are going to assume that P[r|not c] is nearly equal to P[r]. This would be true in a population that largely does not have COVID-19 (small P[c]). This lets us pretend the study reported P[r|c]/P[r] for a study population and try to use it to estimate P[c|r]/P[c] with less algebra.
The scientific reader knows from experience that (for a single population) P[r|c]/P[r] equals P[c|r]/P[c]. So informally an experienced scientific reader interprets the original quote as:
Eating out at restaurants appears to approximately double the risk of contracting COVID-19!
(Note: we are not saying any one visit doubles your risk- but that the behavior associated with eating out at restaurants may double one’s risk. Also we are assuming a casual direction, that COVID-19 doesn’t make you dine out. This doesn’t come from the data and we will discuss this at the end of the note.)
Let’s derive the equation P[c|r]/P[c] = P[r|c]/P[r]. All calculations are done in a single population (either the study population, or a new target population which may have different probabilities).
P[c|r]/P[c] = (P[c and r]/P[r]) / P[c] # substituting in the definition = P[c and r]/(P[c] P[r]) # algebra
P[r|c]/P[r] = (P[r and c]/P[c]) / P[r] # substituting in the definition = P[r and c]/(P[c] P[r]) # algebra = P[c and r]/(P[c] P[r]) # as P[r and c] = P[c and r]
So P[c|r]/P[c] = P[r|c]/P[r] as claimed (as they both equal P[c and r]/(P[c] P[r]).
Bayes’ Law is usually stated as P[c|r] = P[c] P[r|c] / P[r], which is just a different arrangement of the same equation. The P[c and r]/(P[c] P[r]) form relates how much P[c and r] does or does not look like the independent product P[c] P[r].
The P[c|r]/P[c] = P[r|c]/[P[r] form is what converts what was written to what we were interested in. This is an important law of probability to remember: the observed relative change in rate is outcome given specific evidence is the same as the observed relative change in rate of the same evidence given the outcome. Sick people being twice as likely to have recently dined in a restaurant is the same as seeing people having recently dined in a restaurant being twice as likely to be sick.
(Important caveat: from the data alone we can’t determine cause. One has to use either prior domain knowledge or a controlled experiment to form a useful opinion if dining out causes sickness, sickness causes dining out, or if there is a third hidden factor causing both. Retrospective data always only shows association, not cause and effect.)
The equation in terms of the exact reported P[r|c]/P[r|not c] is P[c|r]/P[c] = 1/(P[c] + (1 – P[c]) P[r|not c] / P[r|c]). To actually make a prediction for a given population we would hope that the quantity P[r|c]/P[r|not c] is conserved between different populations. So we would take the P[r|c]/P[r|not c] from the study and then plug-in an estimate for the P[c] from the population we are interested in (not the P[c] from the study!).
A natural follow-up question is why is P[r|c]/P[r|not c] what is reported instead of P[c|r], P[c|r]/P[c], or even P[r|c]/P[r]? The answer is: P[r|c]/P[r|not c] is the only one of these quantities that doesn’t encode the disease prevalence from the study data set. So P[r|c]/P[r|not c] is hopefully less sensitive to the rate of disease in the study population, and allows for different study designs (such as deliberately recruiting sick subjects). Not all statistics are preserved as we move from population to population. Not all statistics and probabilities are preserved when we switch populations, we are hoping P[r|c]/P[r|not c] is one that is preserved.
To make predictions, we combine the P[r|c]/P[r|not c] from the study and then plug in our own P[c] into the last formula we gave to estimate our P[c|r]/P[c]. So if we think we are part of a population with a 5% prevalence P[c] = 0.05. So our estimate of P[c|r]/P[c] = 1/(P[c] + (1 – P[c]) P[r|not c] / P[r|c]) = 1/(0.05 + (1-0.05)*0.5), or about 1.9. This is not so very far from the earlier 2.
Note: the study population was built by a stratified sampling of sick and control subjects. So P[c] for it was 154/(154+160), or about 0.49. 1/(0.49 + (1-0.49)*0.5) is about 1.34, or a smaller 34% relative increase. It would have been a mistake to quote the prevalence increase from the study- as this only applies to high prevalence (very sick) populations.
Categories: Expository Writing Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
“from experience that P[r|c]/P[r|c] equals P[c|r]/P[c].” The first term as written is 1. Probably not what you intended, maybe it should be P[r|c]/P[r] ?
Nice post by the way. Thank you.
Thanks! I introduced that typo when I added the more correct discussion of P[r|not c]. It should be fixed now.