Introduction
Let’s continue along the lines discussed in Omitted Variable Effects in Logistic Regression.
The issue is as follows. For logistic regression, omitted variables cause parameter estimation bias. This is true even for independent variables, which is not the case for more familiar linear regression.
This is a known problem with known mitigations:
- Rhian Daniel, Jingjing Zhang, Daniel Farewell, “Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets”, Biometrical Journal, (2020), DOI: 10.1002/bimj.201900297.
- John M. Neuhaus, Nicholas P. Jewell, “A Geometric Approach to Assess Bias Due to Omitted Covariates in Generalized Linear Models”, Biometrika, Vol. 80, No. 4 (Dec. 1993), pp. 807-815.
- Zhang, Zhiwei, “Estimating a Marginal Causal Odds Ratio Subject to Confounding”, Communications in Statistics – Theory and Methods, 38:3, (2009), 309-321.
(Thank you, Tom Palmer and Robert Horton for the references!)
For this note, let’s work out how to directly try and overcome the omitted variable bias by solving for the hidden or unobserved detailed data. We will work our example in R
. We will derive some deep results out of a simple set-up. We show how to “un-marginalize” or “un-summarize” data.
Our Example
For an example let’s set up a logistic regression on two explanatory variables X1
and X2
. For simplicity we will take the case where X1
and X2
only take on the values 0
and 1
.
Our data is then keyed by the values of these explanatory variables and the dependent or outcome variable Y
, which takes on only the values FALSE
and TRUE
. The keying looks like the following.
x1 | x2 | y |
---|---|---|
0 | 0 | FALSE |
1 | 0 | FALSE |
0 | 1 | FALSE |
1 | 1 | FALSE |
0 | 0 | TRUE |
1 | 0 | TRUE |
0 | 1 | TRUE |
1 | 1 | TRUE |
Note: we are using upper case names for random variables and lower case names for coresponding values of these variables.
The Example Data
Let’s specify the joint probability distribution of our two explanatory variables. We choose them as independent with the following expected values.
# specify explanatory variable distribution
`P(X1=1)` <- 0.3
`P(X2=1)` <- 0.8
`P(X1=0)` <- 1 - `P(X1=1)`
`P(X2=0)` <- 1 - `P(X2=1)`
Our data set can then be completely described by above explanatory variable distribution and the conditional probability of the dependent outcomes. For our logistic regression problem we set up our outcome conditioning as P(Y=TRUE) ~ sigmoid(c0 + b1 * x1 + b2 * x2)
. Our example coefficients are as follows.
# 0.5772
<- -digamma(1)) (c0
## [1] 0.5772157
# 3.1415
<- pi) (b1
## [1] 3.141593
# 27.182
<- -3 * exp(1)) (b2
## [1] -8.154845
Please remember these coefficients in this order for later.
# show constants in an order will see again
c(c0, b1, b2)
## [1] 0.5772157 3.1415927 -8.1548455
Using the methodology of Replicating a Linear Model we can build an example data set that obeys the specified explanatory variable distribution and has specified outcome probabilities. This is just us building a data set matching an assumed known answer. Our data distribution is going to be determined by P(X1=1)
, P(X2=1)
, and P(Y=TRUE) ~ sigmoid(c0 + b1 * x1 + b2 * x2)
. Our inference task is to recover the parameters P(X1=1)
, P(X2=1)
, c0
, b1
, and b2
from data, even in the situation where observers have omitted variable issues.
The complete detailed data is generated as follows. The P(X1=x1, X2=c2, Y=y)
column is what proportion of a data set drawn from this specified distribution matches the row keys x1
, x2
, y
, or is the joint probability of a given row type. We can derive all the detailed probabilities as follows.
# get joint distribution of explanatory variables
"P(X1=x1, X2=x2)"] <- (
detailed_frame[ifelse(detailed_frame$x1 == 1, `P(X1=1)`, `P(X1=0)`)
* ifelse(detailed_frame$x2 == 1, `P(X2=1)`, `P(X2=0)`)
)
# converting "links" to probabilities
<- function(x) {1 / (1 + exp(-x))}
sigmoid
# get conditional probability of observed outcome
<- sigmoid(
y_probability + b1 * detailed_frame$x1 + b2 * detailed_frame$x2)
c0
# record probability of observation
"P(Y=y | X1=x1, X2=x2)"]] <- ifelse(
detailed_frame[[$y,
detailed_frame
y_probability, 1 - y_probability)
# compute joint explanatory plus outcome probability of each row
"P(X1=x1, X2=x2, Y=y)"]] <- (
detailed_frame[["P(X1=x1, X2=x2)"]]
detailed_frame[[* detailed_frame[["P(Y=y | X1=x1, X2=x2)"]])
The following table relates x1
, x2
, y
value combinations to the P(X1=x1, X2=c2, Y=y)
column (which shows how common each such row is).
x1 | x2 | y | P(X1=x1, X2=x2) | P(Y=y | X1=x1, X2=x2) | P(X1=x1, X2=x2, Y=y) |
---|---|---|---|---|---|
0 | 0 | FALSE | 0.14 | 0.3595735 | 0.0503403 |
1 | 0 | FALSE | 0.06 | 0.0236881 | 0.0014213 |
0 | 1 | FALSE | 0.56 | 0.9994885 | 0.5597136 |
1 | 1 | FALSE | 0.24 | 0.9882958 | 0.2371910 |
0 | 0 | TRUE | 0.14 | 0.6404265 | 0.0896597 |
1 | 0 | TRUE | 0.06 | 0.9763119 | 0.0585787 |
0 | 1 | TRUE | 0.56 | 0.0005115 | 0.0002864 |
1 | 1 | TRUE | 0.24 | 0.0117042 | 0.0028090 |
For a logistic regression problem, the relation between X1
, X2
and Y
is encoded in the P(X1=x1, X2=c2, Y=y)
distribution that gives the joint expected frequency of each possible data row in a drawn sample.
Inferring From Fully Observed Data
We can confirm this data set encodes the expected logistic relationship by recovering the coefficients through fitting.
# suppressWarnings() to avoid fractional data weight complaint
<- suppressWarnings(
correct_coef glm(
~ x1 + x2,
y data = detailed_frame,
weights = detailed_frame[["P(X1=x1, X2=x2, Y=y)"]],
family = binomial()
$coef
)
)
correct_coef
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
Notice we recover the c0 + b1 * detailed_frame$x1 + b2 * detailed_frame$x2
form.
The Nonlinear Invariant
There is an interesting non-linear invariant the P(X1=x1, X2=c2, Y=y)
column obeys. We will use this invariant later, so it is worth establishing. The principle is: our solution disappears with respect to certain test-vectors, which will help us re-identify it later.
Consider the following test vector.
<- (
test_vec -1)^detailed_frame$x1
(* (-1)^detailed_frame$x2
* (-1)^detailed_frame$y)
test_vec
## [1] 1 -1 -1 1 -1 1 1 -1
sum(test_vec * log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]]))
is always zero when detailed_frame[["P(X1=x1, X2=x2, Y=y)"])
is the row probabilities from a logistic model of the form we have been working with. Or log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])
is orthogonal to test_vec
. We can confirm this in our case, and derive this in the appendix.
<- test_vec * log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])
p_vec stopifnot( # abort render if claim is not true
abs(sum(p_vec)) < 1e-8)
sum(p_vec)
## [1] -2.553513e-15
Roughly: this is one check that the data is consistent with the distributions a logistic regression with independent explanatory variables can produce.
The Problem
Now let’s get to our issue. Suppose we have two experimenters, each of which only observes one of the explanatory variables. As we saw in Omitted Variable Effects in Logistic Regression each of these experimenters will in fact estimate coefficients that are biased towards zero, due to the non-collapsibility of the modeling set up. This differs from linear regression, where for independent explanatory variables (as we have here) we would expect each experimenter to be able to get an unbiased estimate of the coefficient for the explanatory variable available to them!
The “Unobserved to Observed” Linear Mapping
Let’s build a linear operator that computes the margins the experimenters actually observe. We or the experimenters can specify this mapping, and its output. We just don’t (yet) have complete inforation on the pre-image of this mapping.
::kable(margin_transform, format = "html") |>
knitr::kable_styling(font_size = 10) kableExtra
P(X1=0, X2=0, Y=FALSE) | P(X1=1, X2=0, Y=FALSE) | P(X1=0, X2=1, Y=FALSE) | P(X1=1, X2=1, Y=FALSE) | P(X1=0, X2=0, Y=TRUE) | P(X1=1, X2=0, Y=TRUE) | P(X1=0, X2=1, Y=TRUE) | P(X1=1, X2=1, Y=TRUE) | |
---|---|---|---|---|---|---|---|---|
P(X1=0, X2=*, Y=FALSE) | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
P(X1=1, X2=*, Y=FALSE) | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
P(X1=0, X2=*, Y=TRUE) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
P(X1=1, X2=*, Y=TRUE) | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
P(X1=*, X2=0, Y=FALSE) | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
P(X1=*, X2=1, Y=FALSE) | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
P(X1=*, X2=0, Y=TRUE) | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
P(X1=*, X2=1, Y=TRUE) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
P(X1=0, X2=0, Y=*) | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
P(X1=1, X2=0, Y=*) | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
P(X1=0, X2=1, Y=*) | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
P(X1=1, X2=1, Y=*) | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
The above matrix linearly maps our earlier P(X1=x1, X2=c2, Y=y)
columns to various interesting roll-ups or aggregations. Or, it is 12 linear checks we expect our 8 unobserved distribution parameters to obey. Unfortunately the rank of this linear transform is only 7, so there is redundancy among the checks and the linear relations do not fully specify the unobserved distribution parameters. This is why we need additional criteria to drive our solution.
# apply the linear operator to compute marginalized observations
<- margin_transform %*% detailed_frame[["P(X1=x1, X2=x2, Y=y)"]] actual_margins
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=0, X2=*, Y=FALSE) | 0 | * | FALSE | 0.6100538 |
P(X1=1, X2=*, Y=FALSE) | 1 | * | FALSE | 0.2386123 |
P(X1=0, X2=*, Y=TRUE) | 0 | * | TRUE | 0.0899462 |
P(X1=1, X2=*, Y=TRUE) | 1 | * | TRUE | 0.0613877 |
P(X1=*, X2=0, Y=FALSE) | * | 0 | FALSE | 0.0517616 |
P(X1=*, X2=1, Y=FALSE) | * | 1 | FALSE | 0.7969046 |
P(X1=*, X2=0, Y=TRUE) | * | 0 | TRUE | 0.1482384 |
P(X1=*, X2=1, Y=TRUE) | * | 1 | TRUE | 0.0030954 |
P(X1=0, X2=0, Y=*) | 0 | 0 | * | 0.1400000 |
P(X1=1, X2=0, Y=*) | 1 | 0 | * | 0.0600000 |
P(X1=0, X2=1, Y=*) | 0 | 1 | * | 0.5600000 |
P(X1=1, X2=1, Y=*) | 1 | 1 | * | 0.2400000 |
The above margin frame describes how the detailed experiment is marginalized or censored down to what different experimenters see. In our set-up experimenter 1 sees only the first four rows, and experimenter 2 sees only the next 4 rows. We consider the rest of the data “unobserved”.
A Blind Spot
We also note that margin_transform
is blind to variation in the direction of our earlier test_vec
. This can be confirmed as follows.
<- margin_transform %*% test_vec
test_map
stopifnot(
max(abs(test_map)) < 1e-8)
We knowlog(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])
is orthogonal to test_vec
, but we don’t have an obvious linear relation between detailed_frame[["P(X1=x1, X2=x2, Y=y)"])
and test_vec
.
Fortunately we can show (in an appendix) that the logistic regression is also blind in this direction, so all of the indistinguishable data pre-images give us the same logistic regression solution. Also, we can use a maximum entropy principle to correctly recover the single actual data distribution specified.
Experimenter 1’s view
Let’s see what happens when an experimenter tries to perform inference on their fraction of the data.
# select data available to d1
<- margin_frame[
d1 $x2 == asterisk_symbol, , drop = FALSE]
margin_frame
::kable(d1) knitr
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=0, X2=*, Y=FALSE) | 0 | * | FALSE | 0.6100538 |
P(X1=1, X2=*, Y=FALSE) | 1 | * | FALSE | 0.2386123 |
P(X1=0, X2=*, Y=TRUE) | 0 | * | TRUE | 0.0899462 |
P(X1=1, X2=*, Y=TRUE) | 1 | * | TRUE | 0.0613877 |
# solve from d1's point of view
<- suppressWarnings(
d1_est glm(
~ x1,
y data = d1,
weights = d1$actual_margins,
family = binomial()
$coef
)
)
d1_est
## (Intercept) x1
## -1.9143360 0.5567057
Notice experimenter 1 got a much too small estimate of the X1
coefficient of 0.5567057, whereas the correct value is 3.1415927. From experimenter 1’s point of view, the effect of the omitted variable X2
is making X1
hard to correctly infer.
Experimenter 2’s view
Experimenter 2 has the following portion of data, which also is not enough to get an unbiased coefficient estimate.
# select data available to d2
<- margin_frame[
d2 $x1 == asterisk_symbol, , drop = FALSE]
margin_frame
::kable(d2) knitr
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=*, X2=0, Y=FALSE) | * | 0 | FALSE | 0.0517616 |
P(X1=*, X2=1, Y=FALSE) | * | 1 | FALSE | 0.7969046 |
P(X1=*, X2=0, Y=TRUE) | * | 0 | TRUE | 0.1482384 |
P(X1=*, X2=1, Y=TRUE) | * | 1 | TRUE | 0.0030954 |
Our Point
From the original data set’s point of view: both experimenters have wrong estimates of their respective coefficients. They do have correct estimates for their limited view of columns, but this is not what we are looking for when trying to infer causal effects. The question then is: if the experimenters pool their effort can they infer the correct coefficients?
A Solution Strategy
Each experimenter knows a lot about the data. They known the distribution of their explanatory variable, and even the joint distribution of their explanatory and the dependent or outcome data. Assuming the two explanatory variables are independent, the experimenters can cooperate to estimate the joint distribution of the explanatory variables. We will show how to use their combined observations to estimate the hidden data elements. This data can then be used for standard detailed analysis, like we showed on the original full data set.
This isn’t the first time we have proposed a “guess at the original data, as it wasn’t shared” as we played with this in Checking claims in published statistics papers.
Solution Steps
Our solutions strategy is as follows:
- Estimate the joint distribution of
X1
andX2
from the observed marginal distributions ofX1
andX2
plus an assumption of independence. - Plug the above and other details in to the inverse of
margin_transform
to get a family of estimates of the original hidden data. - Use the maximum entropy principle to pick a distinguished pre-image as the least surprising.
- Perform inference on this data to get coefficient estimates.
Note this strategy biases the data recovery to data sets that match our modeling assumptions. If the original data met our modeling assumptions this is in fact a useful inductive bias. If the original data did not match the modeling assumptions, then this will (unfortunately) hide issues.
Estimating the X1
and X2
joint distribution
Neither experimenter observed the following part of the marginal frame:
# show x1 x2 distribution poriton of margin_frame
<- margin_frame[
dx $y == asterisk_symbol, , drop = FALSE]
margin_frame
::kable(dx) knitr
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=0, X2=0, Y=*) | 0 | 0 | * | 0.14 |
P(X1=1, X2=0, Y=*) | 1 | 0 | * | 0.06 |
P(X1=0, X2=1, Y=*) | 0 | 1 | * | 0.56 |
P(X1=1, X2=1, Y=*) | 1 | 1 | * | 0.24 |
However, under the independence assumption they can estimate it from their pooled observations as follows.
# estimate x1 x2 distribution from d1 and d2
<- aggregate(actual_margins ~ x1, data = d1, sum)
d1a <- aggregate(actual_margins ~ x2, data = d2, sum)
d2a <- merge(d1a, d2a, by = c())
dxe "estimated_margins"] <- (
dxe[$actual_margins.x * dxe$actual_margins.y)
dxe$actual_margins.x <- NULL
dxe$actual_margins.y <- NULL
dxe<- dxe[order(dxe$x2, dxe$x1), , drop = FALSE]
dxe
::kable(dxe) knitr
x1 | x2 | estimated_margins |
---|---|---|
0 | 0 | 0.14 |
1 | 0 | 0.06 |
0 | 1 | 0.56 |
1 | 1 | 0.24 |
Notice dxe
is build only from dx1
and dx2
(plus the assumed independence of X1
and X2
). At this point we have inferred the P(X1=x1, X2=x2)
parameters from the observed data.
Combining Observations
We now combine all of our known data to get an estimate of the (unobserved) summaries produced by margin_transform
.
# put together experimenter 1 and 2's joint estimate of marginal proportions
# from data they have in their sub-experiments.
<- c(
estimated_margins $actual_margins,
d1$actual_margins,
d2$estimated_margins
dxe
)
estimated_margins
## [1] 0.610053847 0.238612287 0.089946153 0.061387713 0.051761580 0.796904554
## [7] 0.148238420 0.003095446 0.140000000 0.060000000 0.560000000 0.240000000
We see that the two experimenters have estimated the output of the margin_frame
transform. As they know the margin_frame
output and the margin_frame
operator itself, they can try to estimate the pre-image or input. This pre-image is the detailed distribution of data they are actually interested in.
Solving For the Full Joint Distribution
We use linear algebra to pull estimated_margins
back through margin_transform
inverse to get a linear estimate of the unobserved original data.
# typical solution (in the linear sense, signs not enforced)
# remember: estimated_margins = margin_transform %*% v
<- solve(
v qr(margin_transform, LAPACK = TRUE),
estimated_margins)
v
## P(X1=0, X2=0, Y=FALSE) P(X1=1, X2=0, Y=FALSE) P(X1=0, X2=1, Y=FALSE)
## 0.047964126 0.003797454 0.562089720
## P(X1=1, X2=1, Y=FALSE) P(X1=0, X2=0, Y=TRUE) P(X1=1, X2=0, Y=TRUE)
## 0.234814833 0.092035874 0.056202546
## P(X1=0, X2=1, Y=TRUE) P(X1=1, X2=1, Y=TRUE)
## -0.002089720 0.005185167
Note this estimate has negative entries, so is not yet a sequence of valid frequencies or probabilities. We will correct this by adding elements that don’t change the forward mapping under margin_transform
. This means we need a linear algebra basis for margin_transform
‘s “null space.” This is gotten as follows. The null space calculation is the systematic way of finding blind-spots in the linear transform, without requiring prior domain knowledge.
# our degree of freedom between solutions
<- MASS::Null(t(margin_transform)) # also uses QR decomposition, could combine
ns stopifnot( # abort render if this claim is not true
ncol(ns) == 1
)
# ns is invariant under scaling, pick first coordinate to be 1 for presentation
<- ns / ns[[1]]
ns
ns
## [1] 1 -1 -1 1 -1 1 1 -1
In our case the null space was one dimensional, or spanned by a single vector. This means all valid solutions are of the form v + z * ns
for scalars z
. In fact all solutions are in an interval of z
values. We can solve for this interval.
Note, we have seen the direction we are varying (ns
) before: it is test_vec
!
The range of recovered solutions to the (unknown to either experimenter!) original data distribution details can be seen below as the recovered_distribution_*
columns.
x1 | x2 | y | P(X1=x1, X2=x2, Y=y) | recovered_distribution_1 | recovered_distribution_2 |
---|---|---|---|---|---|
0 | 0 | FALSE | 0.0503403 | 0.0500538 | 0.0517616 |
1 | 0 | FALSE | 0.0014213 | 0.0017077 | 0.0000000 |
0 | 1 | FALSE | 0.5597136 | 0.5600000 | 0.5582923 |
1 | 1 | FALSE | 0.2371910 | 0.2369046 | 0.2386123 |
0 | 0 | TRUE | 0.0896597 | 0.0899462 | 0.0882384 |
1 | 0 | TRUE | 0.0585787 | 0.0582923 | 0.0600000 |
0 | 1 | TRUE | 0.0002864 | 0.0000000 | 0.0017077 |
1 | 1 | TRUE | 0.0028090 | 0.0030954 | 0.0013877 |
The actual solution is in the convex hull of the two extreme solutions. And the logistic regression is blind to changes in the test_vec
direction (shown in appendix). So we can recover the correct logistic regression coefficients from any of these solutions.
for (soln_name in soln_names) {
print(soln_name)
suppressWarnings(
<- glm(
soln_i ~ x1 + x2,
y data = detailed_frame,
weights = detailed_frame[[soln_name]],
family = binomial()
$coef
)
)print(soln_i)
stopifnot( # abort render if this claim is not true
max(abs(correct_coef - soln_i)) < 1e-6)
}
## [1] "recovered_distribution_1"
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
## [1] "recovered_distribution_2"
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
We see, all recovered data distributions give the same correct estimates of the logistic regression coefficients.
Picking a point-estimate
The standard trick with an under-specified system is to add an objective. A great choice is: maximize the entropy of (or flatness of) the distribution we are solving for.
This works as follows.
<- function(v) {
entropy <- v[v > 0]
v if (length(v) < 2) {
return(0)
}<- v / sum(v)
v -sum(v * log2(v))
}
# brute force solve for maximum entropy mix
# obviously this can be done a bit slicker
<- optimize(
opt_soln function(z) {
entropy(
* detailed_frame$recovered_distribution_1 +
z 1 - z) * detailed_frame$recovered_distribution_2)},
(c(0, 1),
maximum = TRUE)
<- opt_soln$maximum
z_opt "maxent_distribution"] <- (
detailed_frame[* detailed_frame$recovered_distribution_1 +
z_opt 1 - z_opt) * detailed_frame$recovered_distribution_2) (
The recovered maxent_distribution
obeys the additional non-linear check to a high degree.
log(detailed_frame[["maxent_distribution"]]) %*% test_vec
## [,1]
## [1,] 3.395224e-05
In fact, the recovered maxent_distribution
is the original unobserved original P(X1=x1, X2=x2, Y=y)
to many digits.
x1 | x2 | y | P(X1=x1, X2=x2, Y=y) | maxent_distribution |
---|---|---|---|---|
0 | 0 | FALSE | 0.0503403 | 0.0503403 |
1 | 0 | FALSE | 0.0014213 | 0.0014213 |
0 | 1 | FALSE | 0.5597136 | 0.5597135 |
1 | 1 | FALSE | 0.2371910 | 0.2371910 |
0 | 0 | TRUE | 0.0896597 | 0.0896597 |
1 | 0 | TRUE | 0.0585787 | 0.0585787 |
0 | 1 | TRUE | 0.0002864 | 0.0002865 |
1 | 1 | TRUE | 0.0028090 | 0.0028090 |
And these are our estimated coefficients.
<- suppressWarnings(
recovered_coef glm(
~ x1 + x2,
y data = detailed_frame,
weights = detailed_frame[["maxent_distribution"]],
family = binomial()
$coef
)
)
recovered_coef
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
This matches the correct (c0=0.5772, b1=3.1416, b2=-8.1548). We have correctly inferred the actual coefficient values from the observed data. We have removed the bias.
Why the Maximum Entropy Solution is So Good
Some calculus (in appendix) shows that the entropy function for this problem is maximized where the logarithm of the joint distribution is orthogonal to ns
or test_vec
. So the maximum entropy condition will enforce the extra non-linear invariant we know from our assumed problem structure.
The funny thing is, we don’t have to know exactly what the maximum entropy objective was doing to actually benefit from it. It tends to be a helpful objective in modeling. In practice we don’t usually derive test_vec
but just impose the maximum entropy objective and trust that it will help.
Conclusion
By pooling observations we can recover a good estimate of a joint analysis on data that was not available to us. The strategy is: try to estimate plausible pre-images of the data that formed the observations, and then analyze that. This gives us a method to invert the bias introduced by the omitted variables in logistic regression.
In machine learning the maximum entropy principle plays the role that the stationary-action principle action plays in classic mechanics. While nature isn’t forced to put equal probabilities on different states, deterministic models must put equal probabilities on model indistinguishable states. Maximum entropy pushes solutions to such symmetries, unless there are variables to support differences. And, maximum entropy modeling is very related to logistic regression modeling.
There is, however, a danger. A naive over-reliance on the principle of indifference can lead to incorrect modeling. Nature may be able to distinguish between states that a given set of experimental variables can not. Also, the general applicability of maximum entropy techniques isn’t an excuse to not look for problem specific reasons why such an objective helps. This is what we did in this note when developing the non-linear orthogonality condition. This condition is a consequence of the fact that the logit-linear form of the logistic regression we, as the experimenter, imposed on the data. At some point we are observing the regularity of our assumptions, not of the original unobserved data.
In the real world we would at best be looking at marginalizations of different draws of related data. So we would not have exact matches we can invert- but instead would have to estimate low-discrepancy pre-images of the data. And, as we are now introducing a lot of unobserved parameters, we could go to Bayesian graphical model methods to sum this all out (instead of proposing a specific point-wise method as we did here).
We have some notes on how this method applies in a more general case
here.
Thank you to Dr. Nina Zumel for help and comments.
Appendices
Appendix: The Relation to Entropy
The maximum likelihood solution to a logistic regression problem is equivalent picking a paramaterized distribution q
close to the target distribution p
by minimizing the cross entropy below.
- sumi pi log qi
When q
gets close to p
this looks a lot like the standard entropy below.
- sumi pi log pi
So we do expect entropy calculations to be relevant to logistic regression structure. We will back up this claim with detailed calculation.
Appendix: test_vec
is an Orthogonal Test
To show sum(test_vec * log(P(X1=x1, X2=x2, Y=y)) = 0
when P(X1=x1, X2=x2, Y=y)
is the row probabilities matching a logistic model, write sum(test_vec * log(P(X1=x1, X2=x2, Y=y)))
as:
sumx1=0,1 sumx2=0,1 sumy=F,T ( (-1)x1 * (-1)x2 * (-1)y * log(P(X1=x1, X2=x2) * p(Y=y | x1, x2)) ) = sumx1=0,1 sumx2=0,1 ( (-1)x1 * (-1)x2 * ( log(P(X1=x1, X2=x2) * (1 - 1 / (1 + exp(c0 + b1 * x1 + b2 * x2)))) - log(P(X1=x1, X2=x2) * 1 / (1 + exp(c0 + b1 * x1 + b2 * x2))) )) = sumx1=0,1 sumx2=0,1 ( (-1)x1 * (-1)x2 * ( log(P(X1=x1, X2=x2) * (exp(c0 + b1 * x1 + b2 * x2) / (1 + exp(c0 + b1 * x1 + b2 * x2)))) - log(P(X1=x1, X2=x2) * 1 / (1 + exp(c0 + b1 * x1 + b2 * x2))) )) = sumx1=0,1 sumx2=0,1 ( (-1)x1 * (-1)x2 * log(exp(c0 + b1 * x1 + b2 * x2))) = sumx1=0,1 sumx2=0,1 ( (-1)x1 * (-1)x2 * (c0 + b1 * x1 + b2 * x2)) = 0
This establishes that sum(test_vec * log(P(X1=x1, X2=x2, Y=y))) = 0
for any logistic regression solution, not just the optimal one. This condition is true for our data set, as we designed it to have the structure of a logistic regression. And this shows logistic regression can not tell P(X1=x1, X2=x2, Y=y) + z * test_vec
from P(X1=x1, X2=x2, Y=y)
, as it is blind to changes in that direction. This is why all our data pre-images yield the same logistic regression coefficients.
Appendix: the Entropy Gradient Goes to Zero at our Check Position
We can show the entropy gradient is zero at our check-gradient position. So, maximizing entropy picks the position where we meet our non-linear orthogonal check condition.
To establish this, consider the entropy function we are maximizing f(z) = -sumi (pi + z * test_veci) log(pi + z * test_veci)
. We expect our maximum occurs where f(z)
has a zero derivative.
(d / d z) f(z) [evaluated at z = 0] = (d / d z) -sumi ( pi + z * test_veci) * log(pi + z * test_veci) [evaluated at z = 0] = -sumi test_veci ( log(pi + z * test_veci) + 1) [evaluated at z = 0] = -sumi test_veci (log(pi) + 1) = -sumi test_veci log(pi) [using -sumi test_veci = 0]
And this is zero exactly where the non-linear orthogonal check condition is zero.
Appendix: Links
The source code of this article is available here (plus render here).
Categories: Exciting Techniques Tutorials
Leave a Reply