I am planning a new example-based series of articles using what I am calling probability model homotopy. This is a notation I am introducing to slow down and make clearer discussing how probability models perform on different populations.
As motivation take “A Gruesome Example of Bayes’ Law” where we have statistics collected from a population that, by design, had a positive SARS-CoV-2 test result prevalence of 49%
, but intended to be applied to populations with much lower prevalences. The usual way of moving statistics between populations of different prevalences is to report only statistics that are independent of the population prevalence, such as the odds-ratio of evidence conditioned on outcome. This is a standard, sound, and effective technique. We will use the probability model homotopy tool to show this is not the only admissible method.
A probability model homotopy is defined as follows.
Let
X
be a space of explanatory vectors or probability model inputs. Vector is used in the computer science sense, which means fixed length list of scalars (we don’t need the entries to be all numbers, or any algebraic structure).By prevalence we will mean the expected value of the categorical dependent variable (quantity to be predicted) or the expected value of the probability prediction (even though this is a real number in the range
[0, 1]
).A probability model
m : X ➝ [0, 1]
is a map fromX
to predicted probabilities in the interval of real numbers[0, 1]
.A probability model homotopy
H : X ✕ [0, 1] ➝ [0, 1]
is a map from pairs of explanatory vectors and posited prevalences (numbers drawn from[0, 1]
) to predicted probabilities in the interval of real numbers[0, 1]
.Hp : X ➝ [0, 1]
is defined as the probability model that takesx
toH(x, p)
.
With this notation defined we can discuss a few model homotopies.
- The oblivious (improper) model homotopy defined as
O : X ✕ [0, 1] ➝ [0, 1]
whereO(x, p) = m(x)
for some probability modelm
. This is essentially a procedure commonly used by data scientists as their models typically applied to situations that have the same expected prevalence as the training data. This means this procedure is largely used in the narrow situations where it is in fact correct. This is the model homotopy one is implicitly using if one says “I don’t need this adjustment/homotopy stuff, I’ll just use my model.” - The balanced (improper) model homotopy defined as
B : X ✕ [0, 1] ➝ [0, 1]
whereB(x, p) = m(x)
for a probability modelm
trained on a re-weighted or re-sampled training set made to have a apparent prevalence of0.5
. This incorrect procedure is, unfortunately, very popular in data science as a purported improvement when working with rare outcome prevalences. Likely its popularity is a need to work around the common additional common mis-practice of using classification rules where probability models would be more appropriate. For some discussion on this issue please see “Don’t Use Classification Rules for Classification Problems”. - The point-wise shift model homotopy defined as
P : X ✕ [0, 1] ➝ [0, 1]
whereP(x, p) = sigmoid(logit(p) + logit(m(x)) - logit(q))
for some probability modelm
whereq
is the outcome prevalence the modelm
was trained on. This adjustment is essentially a point-wise application of Bayes’ Law. We can show this transform can be biased in the sense that the expected value ofSp
is not alwaysp
. - The unbiased shift model homotopy defined as
U : X ✕ [0, 1] ➝ [0, 1]
whereU(x, p) = sigmoid(logit(m(x)) + a(p))
for some probability modelm
wherea(p)
is a scalar picked so that the expected value ofUp
isp
on an appropriate sample space. We will showU
is not always the same model homotopy asP
, even for models as simple as logistic regression. - The tailored model homotopy defined as
T : X ✕ [0, 1] ➝ [0, 1]
whereT(x, p) = mp(x)
where probability modelmp
is an unbiased probability model fit on a re-weighted training set with outcome prevalencep
. The tailored model can be thought of mathematically as an uncountable lookup table with one model for every possible prevalence. This homotopy model can be realized a number of different ways, one obvious one being retaining a training data set and re-training the model on a re-weighted version of this training set for whateverp
is required. We believe this model homotopy is a very favorable one, and not in general equivalent to any of the model homotopies mentioned up to now. One of our points is that many projects act as if they are usingT
, when the are in fact using one of the other model homotopies we have discussed.
The purpose of the series is to exhibit concrete reasonable examples proving the above claims.
In general probability model homotopies are not fully homotopies in the mathematical sense, as they may not meet all of the continuity conditions. We are going to primarily study probability model homotopy in the context of (L2 regularized) logistic regression, where we will have enough continuity conditions to establish a true homotopy from the model that always predicts 0
to the model that always predicts 1
.
My feeling is: model homotopies are already around, and data scientists regularly work with them. Families of models that differ only by prevalence or by a regularization parameter are model homotopies. There is some clumsiness in looking only at models one at a time, when in fact we are often interested in related families of models.
My thesis is: predictive models encode the prevalence of their training set, which can be counter-productive if the model is to be used in a situation with different prevalence. The usual fixes are to ignore the situation, try to strip off the prevalence information, or to try and adjust it. We feel with richer data structure than a single model or conditional evidence densities that additional correct procedures become possible, realizable, and obvious. Setting up these definitions may seem laborious, but they are in fact less effort than having no clear way to differentiate between O
, B
, P
, U
, or T
in discussion.
Categories: Mathematics
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.