Menu Home

Solving for Hidden Data

Introduction

Let’s continue along the lines discussed in Omitted Variable Effects in Logistic Regression.

The issue is as follows. For logistic regression, omitted variables cause parameter estimation bias. This is true even for independent variables, which is not the case for more familiar linear regression.

This is a known problem with known mitigations:

  • Rhian Daniel, Jingjing Zhang, Daniel Farewell, “Making apples from oranges: Comparing noncollapsible effect estimators and their standard errors after adjustment for different covariate sets”, Biometrical Journal, (2020), DOI: 10.1002/bimj.201900297.
  • John M. Neuhaus, Nicholas P. Jewell, “A Geometric Approach to Assess Bias Due to Omitted Covariates in Generalized Linear Models”, Biometrika, Vol. 80, No. 4 (Dec. 1993), pp. 807-815.
  • Zhang, Zhiwei, “Estimating a Marginal Causal Odds Ratio Subject to Confounding”, Communications in Statistics – Theory and Methods, 38:3, (2009), 309-321.

(Thank you, Tom Palmer and Robert Horton for the references!)

For this note, let’s work out how to directly try and overcome the omitted variable bias by solving for the hidden or unobserved detailed data. We will work our example in R. We will derive some deep results out of a simple set-up. We show how to “un-marginalize” or “un-summarize” data.

Our Example

For an example let’s set up a logistic regression on two explanatory variables X1 and X2 . For simplicity we will take the case where X1 and X2 only take on the values 0 and 1.

Our data is then keyed by the values of these explanatory variables and the dependent or outcome variable Y, which takes on only the values FALSE and TRUE. The keying looks like the following.

x1 x2 y
0 0 FALSE
1 0 FALSE
0 1 FALSE
1 1 FALSE
0 0 TRUE
1 0 TRUE
0 1 TRUE
1 1 TRUE

Note: we are using upper case names for random variables and lower case names for coresponding values of these variables.

The Example Data

Let’s specify the joint probability distribution of our two explanatory variables. We choose them as independent with the following expected values.

# specify explanatory variable distribution 
`P(X1=1)` <- 0.3
`P(X2=1)` <- 0.8
`P(X1=0)` <- 1 - `P(X1=1)`
`P(X2=0)` <- 1 - `P(X2=1)`

Our data set can then be completely described by above explanatory variable distribution and the conditional probability of the dependent outcomes. For our logistic regression problem we set up our outcome conditioning as P(Y=TRUE) ~ sigmoid(c0 + b1 * x1 + b2 * x2). Our example coefficients are as follows.

# 0.5772
(c0 <- -digamma(1))
## [1] 0.5772157
# 3.1415
(b1 <- pi)
## [1] 3.141593
# 27.182
(b2 <- -3 * exp(1))
## [1] -8.154845

Please remember these coefficients in this order for later.

# show constants in an order will see again
c(c0, b1, b2)
## [1]  0.5772157  3.1415927 -8.1548455

Using the methodology of Replicating a Linear Model we can build an example data set that obeys the specified explanatory variable distribution and has specified outcome probabilities. This is just us building a data set matching an assumed known answer. Our data distribution is going to be determined by P(X1=1), P(X2=1), and P(Y=TRUE) ~ sigmoid(c0 + b1 * x1 + b2 * x2). Our inference task is to recover the parameters P(X1=1), P(X2=1), c0, b1, and b2 from data, even in the situation where observers have omitted variable issues.

The complete detailed data is generated as follows. The P(X1=x1, X2=c2, Y=y) column is what proportion of a data set drawn from this specified distribution matches the row keys x1, x2, y, or is the joint probability of a given row type. We can derive all the detailed probabilities as follows.

# get joint distribution of explanatory variables
detailed_frame["P(X1=x1, X2=x2)"] <- (
  ifelse(detailed_frame$x1 == 1, `P(X1=1)`, `P(X1=0)`)
  * ifelse(detailed_frame$x2 == 1, `P(X2=1)`, `P(X2=0)`)
)

# converting "links" to probabilities
sigmoid <- function(x) {1 / (1 + exp(-x))}

# get conditional probability of observed outcome
y_probability <- sigmoid(
  c0 + b1 * detailed_frame$x1 + b2 * detailed_frame$x2)

# record probability of observation
detailed_frame[["P(Y=y | X1=x1, X2=x2)"]] <- ifelse(
  detailed_frame$y, 
  y_probability, 
  1 - y_probability)

# compute joint explanatory plus outcome probability of each row
detailed_frame[["P(X1=x1, X2=x2, Y=y)"]] <- (
  detailed_frame[["P(X1=x1, X2=x2)"]] 
  * detailed_frame[["P(Y=y | X1=x1, X2=x2)"]])

The following table relates x1, x2, y value combinations to the P(X1=x1, X2=c2, Y=y) column (which shows how common each such row is).

x1 x2 y P(X1=x1, X2=x2) P(Y=y | X1=x1, X2=x2) P(X1=x1, X2=x2, Y=y)
0 0 FALSE 0.14 0.3595735 0.0503403
1 0 FALSE 0.06 0.0236881 0.0014213
0 1 FALSE 0.56 0.9994885 0.5597136
1 1 FALSE 0.24 0.9882958 0.2371910
0 0 TRUE 0.14 0.6404265 0.0896597
1 0 TRUE 0.06 0.9763119 0.0585787
0 1 TRUE 0.56 0.0005115 0.0002864
1 1 TRUE 0.24 0.0117042 0.0028090

For a logistic regression problem, the relation between X1, X2 and Y is encoded in the P(X1=x1, X2=c2, Y=y) distribution that gives the joint expected frequency of each possible data row in a drawn sample.

Inferring From Fully Observed Data

We can confirm this data set encodes the expected logistic relationship by recovering the coefficients through fitting.

# suppressWarnings() to avoid fractional data weight complaint
correct_coef <- suppressWarnings(
  glm(
    y ~ x1 + x2,
    data = detailed_frame,
    weights = detailed_frame[["P(X1=x1, X2=x2, Y=y)"]],
    family = binomial()
  )$coef
)

correct_coef
## (Intercept)          x1          x2 
##   0.5772157   3.1415927  -8.1548455

Notice we recover the c0 + b1 * detailed_frame$x1 + b2 * detailed_frame$x2 form.

The Nonlinear Invariant

There is an interesting non-linear invariant the P(X1=x1, X2=c2, Y=y) column obeys. We will use this invariant later, so it is worth establishing. The principle is: our solution disappears with respect to certain test-vectors, which will help us re-identify it later.

Consider the following test vector.

test_vec <- (
  (-1)^detailed_frame$x1 
  * (-1)^detailed_frame$x2 
  * (-1)^detailed_frame$y)

test_vec
## [1]  1 -1 -1  1 -1  1  1 -1

sum(test_vec * log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])) is always zero when detailed_frame[["P(X1=x1, X2=x2, Y=y)"]) is the row probabilities from a logistic model of the form we have been working with. Or log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]]) is orthogonal to test_vec. We can confirm this in our case, and derive this in the appendix.

p_vec <- test_vec * log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])
stopifnot(  # abort render if claim is not true
  abs(sum(p_vec)) < 1e-8)

sum(p_vec)
## [1] -2.553513e-15

Roughly: this is one check that the data is consistent with the distributions a logistic regression with independent explanatory variables can produce.

The Problem

Now let’s get to our issue. Suppose we have two experimenters, each of which only observes one of the explanatory variables. As we saw in Omitted Variable Effects in Logistic Regression each of these experimenters will in fact estimate coefficients that are biased towards zero, due to the non-collapsibility of the modeling set up. This differs from linear regression, where for independent explanatory variables (as we have here) we would expect each experimenter to be able to get an unbiased estimate of the coefficient for the explanatory variable available to them!

The “Unobserved to Observed” Linear Mapping

Let’s build a linear operator that computes the margins the experimenters actually observe. We or the experimenters can specify this mapping, and its output. We just don’t (yet) have complete inforation on the pre-image of this mapping.

knitr::kable(margin_transform, format = "html") |>
  kableExtra::kable_styling(font_size = 10)
P(X1=0, X2=0, Y=FALSE) P(X1=1, X2=0, Y=FALSE) P(X1=0, X2=1, Y=FALSE) P(X1=1, X2=1, Y=FALSE) P(X1=0, X2=0, Y=TRUE) P(X1=1, X2=0, Y=TRUE) P(X1=0, X2=1, Y=TRUE) P(X1=1, X2=1, Y=TRUE)
P(X1=0, X2=*, Y=FALSE) 1 0 1 0 0 0 0 0
P(X1=1, X2=*, Y=FALSE) 0 1 0 1 0 0 0 0
P(X1=0, X2=*, Y=TRUE) 0 0 0 0 1 0 1 0
P(X1=1, X2=*, Y=TRUE) 0 0 0 0 0 1 0 1
P(X1=*, X2=0, Y=FALSE) 1 1 0 0 0 0 0 0
P(X1=*, X2=1, Y=FALSE) 0 0 1 1 0 0 0 0
P(X1=*, X2=0, Y=TRUE) 0 0 0 0 1 1 0 0
P(X1=*, X2=1, Y=TRUE) 0 0 0 0 0 0 1 1
P(X1=0, X2=0, Y=*) 1 0 0 0 1 0 0 0
P(X1=1, X2=0, Y=*) 0 1 0 0 0 1 0 0
P(X1=0, X2=1, Y=*) 0 0 1 0 0 0 1 0
P(X1=1, X2=1, Y=*) 0 0 0 1 0 0 0 1

The above matrix linearly maps our earlier P(X1=x1, X2=c2, Y=y) columns to various interesting roll-ups or aggregations. Or, it is 12 linear checks we expect our 8 unobserved distribution parameters to obey. Unfortunately the rank of this linear transform is only 7, so there is redundancy among the checks and the linear relations do not fully specify the unobserved distribution parameters. This is why we need additional criteria to drive our solution.

# apply the linear operator to compute marginalized observations
actual_margins <- margin_transform %*% detailed_frame[["P(X1=x1, X2=x2, Y=y)"]]
x1 x2 y actual_margins
P(X1=0, X2=*, Y=FALSE) 0 * FALSE 0.6100538
P(X1=1, X2=*, Y=FALSE) 1 * FALSE 0.2386123
P(X1=0, X2=*, Y=TRUE) 0 * TRUE 0.0899462
P(X1=1, X2=*, Y=TRUE) 1 * TRUE 0.0613877
P(X1=*, X2=0, Y=FALSE) * 0 FALSE 0.0517616
P(X1=*, X2=1, Y=FALSE) * 1 FALSE 0.7969046
P(X1=*, X2=0, Y=TRUE) * 0 TRUE 0.1482384
P(X1=*, X2=1, Y=TRUE) * 1 TRUE 0.0030954
P(X1=0, X2=0, Y=*) 0 0 * 0.1400000
P(X1=1, X2=0, Y=*) 1 0 * 0.0600000
P(X1=0, X2=1, Y=*) 0 1 * 0.5600000
P(X1=1, X2=1, Y=*) 1 1 * 0.2400000

The above margin frame describes how the detailed experiment is marginalized or censored down to what different experimenters see. In our set-up experimenter 1 sees only the first four rows, and experimenter 2 sees only the next 4 rows. We consider the rest of the data “unobserved”.

A Blind Spot

We also note that margin_transform is blind to variation in the direction of our earlier test_vec. This can be confirmed as follows.

test_map <- margin_transform %*% test_vec

stopifnot(
  max(abs(test_map)) < 1e-8)

We knowlog(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]]) is orthogonal to test_vec, but we don’t have an obvious linear relation between detailed_frame[["P(X1=x1, X2=x2, Y=y)"]) and test_vec.

Fortunately we can show (in an appendix) that the logistic regression is also blind in this direction, so all of the indistinguishable data pre-images give us the same logistic regression solution. Also, we can use a maximum entropy principle to correctly recover the single actual data distribution specified.

Experimenter 1’s view

Let’s see what happens when an experimenter tries to perform inference on their fraction of the data.

# select data available to d1
d1 <- margin_frame[
  margin_frame$x2 == asterisk_symbol, , drop = FALSE]

knitr::kable(d1)
x1 x2 y actual_margins
P(X1=0, X2=*, Y=FALSE) 0 * FALSE 0.6100538
P(X1=1, X2=*, Y=FALSE) 1 * FALSE 0.2386123
P(X1=0, X2=*, Y=TRUE) 0 * TRUE 0.0899462
P(X1=1, X2=*, Y=TRUE) 1 * TRUE 0.0613877
# solve from d1's point of view
d1_est <- suppressWarnings(
  glm(
    y ~ x1,
    data = d1,
    weights = d1$actual_margins,
    family = binomial()
  )$coef
)

d1_est
## (Intercept)          x1 
##  -1.9143360   0.5567057

Notice experimenter 1 got a much too small estimate of the X1 coefficient of 0.5567057, whereas the correct value is 3.1415927. From experimenter 1’s point of view, the effect of the omitted variable X2 is making X1 hard to correctly infer.

Experimenter 2’s view

Experimenter 2 has the following portion of data, which also is not enough to get an unbiased coefficient estimate.

# select data available to d2
d2 <- margin_frame[
  margin_frame$x1 == asterisk_symbol, , drop = FALSE]

knitr::kable(d2)
x1 x2 y actual_margins
P(X1=*, X2=0, Y=FALSE) * 0 FALSE 0.0517616
P(X1=*, X2=1, Y=FALSE) * 1 FALSE 0.7969046
P(X1=*, X2=0, Y=TRUE) * 0 TRUE 0.1482384
P(X1=*, X2=1, Y=TRUE) * 1 TRUE 0.0030954

Our Point

From the original data set’s point of view: both experimenters have wrong estimates of their respective coefficients. They do have correct estimates for their limited view of columns, but this is not what we are looking for when trying to infer causal effects. The question then is: if the experimenters pool their effort can they infer the correct coefficients?

A Solution Strategy

Each experimenter knows a lot about the data. They known the distribution of their explanatory variable, and even the joint distribution of their explanatory and the dependent or outcome data. Assuming the two explanatory variables are independent, the experimenters can cooperate to estimate the joint distribution of the explanatory variables. We will show how to use their combined observations to estimate the hidden data elements. This data can then be used for standard detailed analysis, like we showed on the original full data set.

This isn’t the first time we have proposed a “guess at the original data, as it wasn’t shared” as we played with this in Checking claims in published statistics papers.

Solution Steps

Our solutions strategy is as follows:

  • Estimate the joint distribution of X1 and X2 from the observed marginal distributions of X1 and X2 plus an assumption of independence.
  • Plug the above and other details in to the inverse of margin_transform to get a family of estimates of the original hidden data.
  • Use the maximum entropy principle to pick a distinguished pre-image as the least surprising.
  • Perform inference on this data to get coefficient estimates.

Note this strategy biases the data recovery to data sets that match our modeling assumptions. If the original data met our modeling assumptions this is in fact a useful inductive bias. If the original data did not match the modeling assumptions, then this will (unfortunately) hide issues.

Estimating the X1 and X2 joint distribution

Neither experimenter observed the following part of the marginal frame:

# show x1 x2 distribution poriton of margin_frame
dx <- margin_frame[
  margin_frame$y == asterisk_symbol, , drop = FALSE]

knitr::kable(dx)
x1 x2 y actual_margins
P(X1=0, X2=0, Y=*) 0 0 * 0.14
P(X1=1, X2=0, Y=*) 1 0 * 0.06
P(X1=0, X2=1, Y=*) 0 1 * 0.56
P(X1=1, X2=1, Y=*) 1 1 * 0.24

However, under the independence assumption they can estimate it from their pooled observations as follows.

# estimate x1 x2 distribution from d1 and d2
d1a <- aggregate(actual_margins ~ x1, data = d1, sum)
d2a <- aggregate(actual_margins ~ x2, data = d2, sum)
dxe <- merge(d1a, d2a, by = c())
dxe["estimated_margins"] <- (
  dxe$actual_margins.x * dxe$actual_margins.y)
dxe$actual_margins.x <- NULL
dxe$actual_margins.y <- NULL
dxe <- dxe[order(dxe$x2, dxe$x1), , drop = FALSE]

knitr::kable(dxe)
x1 x2 estimated_margins
0 0 0.14
1 0 0.06
0 1 0.56
1 1 0.24

Notice dxe is build only from dx1 and dx2 (plus the assumed independence of X1 and X2). At this point we have inferred the P(X1=x1, X2=x2) parameters from the observed data.

Combining Observations

We now combine all of our known data to get an estimate of the (unobserved) summaries produced by margin_transform.

# put together experimenter 1 and 2's joint estimate of marginal proportions
# from data they have in their sub-experiments.
estimated_margins <- c(
  d1$actual_margins,
  d2$actual_margins,
  dxe$estimated_margins
)

estimated_margins
##  [1] 0.610053847 0.238612287 0.089946153 0.061387713 0.051761580 0.796904554
##  [7] 0.148238420 0.003095446 0.140000000 0.060000000 0.560000000 0.240000000

We see that the two experimenters have estimated the output of the margin_frame transform. As they know the margin_frame output and the margin_frame operator itself, they can try to estimate the pre-image or input. This pre-image is the detailed distribution of data they are actually interested in.

Solving For the Full Joint Distribution

We use linear algebra to pull estimated_margins back through margin_transform inverse to get a linear estimate of the unobserved original data.

# typical solution (in the linear sense, signs not enforced)
# remember: estimated_margins = margin_transform %*% v
v <- solve(
  qr(margin_transform, LAPACK = TRUE), 
  estimated_margins)

v
## P(X1=0, X2=0, Y=FALSE) P(X1=1, X2=0, Y=FALSE) P(X1=0, X2=1, Y=FALSE) 
##            0.047964126            0.003797454            0.562089720 
## P(X1=1, X2=1, Y=FALSE)  P(X1=0, X2=0, Y=TRUE)  P(X1=1, X2=0, Y=TRUE) 
##            0.234814833            0.092035874            0.056202546 
##  P(X1=0, X2=1, Y=TRUE)  P(X1=1, X2=1, Y=TRUE) 
##           -0.002089720            0.005185167

Note this estimate has negative entries, so is not yet a sequence of valid frequencies or probabilities. We will correct this by adding elements that don’t change the forward mapping under margin_transform. This means we need a linear algebra basis for margin_transform‘s “null space.” This is gotten as follows. The null space calculation is the systematic way of finding blind-spots in the linear transform, without requiring prior domain knowledge.

# our degree of freedom between solutions
ns <- MASS::Null(t(margin_transform))  # also uses QR decomposition, could combine
stopifnot(  # abort render if this claim is not true
  ncol(ns) == 1
)
# ns is invariant under scaling, pick first coordinate to be 1 for presentation
ns <- ns / ns[[1]]

ns
## [1]  1 -1 -1  1 -1  1  1 -1

In our case the null space was one dimensional, or spanned by a single vector. This means all valid solutions are of the form v + z * ns for scalars z. In fact all solutions are in an interval of z values. We can solve for this interval.

Note, we have seen the direction we are varying (ns) before: it is test_vec!

The range of recovered solutions to the (unknown to either experimenter!) original data distribution details can be seen below as the recovered_distribution_* columns.

x1 x2 y P(X1=x1, X2=x2, Y=y) recovered_distribution_1 recovered_distribution_2
0 0 FALSE 0.0503403 0.0500538 0.0517616
1 0 FALSE 0.0014213 0.0017077 0.0000000
0 1 FALSE 0.5597136 0.5600000 0.5582923
1 1 FALSE 0.2371910 0.2369046 0.2386123
0 0 TRUE 0.0896597 0.0899462 0.0882384
1 0 TRUE 0.0585787 0.0582923 0.0600000
0 1 TRUE 0.0002864 0.0000000 0.0017077
1 1 TRUE 0.0028090 0.0030954 0.0013877

The actual solution is in the convex hull of the two extreme solutions. And the logistic regression is blind to changes in the test_vec direction (shown in appendix). So we can recover the correct logistic regression coefficients from any of these solutions.

for (soln_name in soln_names) {
  print(soln_name)
  suppressWarnings(
    soln_i <- glm(
      y ~ x1 + x2,
      data = detailed_frame,
      weights = detailed_frame[[soln_name]],
      family = binomial()
    )$coef
  )
  print(soln_i)
  stopifnot(  # abort render if this claim is not true
    max(abs(correct_coef - soln_i)) < 1e-6)
}
## [1] "recovered_distribution_1"
## (Intercept)          x1          x2 
##   0.5772157   3.1415927  -8.1548455 
## [1] "recovered_distribution_2"
## (Intercept)          x1          x2 
##   0.5772157   3.1415927  -8.1548455

We see, all recovered data distributions give the same correct estimates of the logistic regression coefficients.

Picking a point-estimate

The standard trick with an under-specified system is to add an objective. A great choice is: maximize the entropy of (or flatness of) the distribution we are solving for.

This works as follows.

entropy <- function(v) {
  v <- v[v > 0]
  if (length(v) < 2) {
    return(0)
  }
  v <- v / sum(v)
  -sum(v * log2(v))
}
# brute force solve for maximum entropy mix
# obviously this can be done a bit slicker
opt_soln <- optimize(
  function(z) {
    entropy(
      z * detailed_frame$recovered_distribution_1 + 
        (1 - z) * detailed_frame$recovered_distribution_2)},
  c(0, 1),
  maximum = TRUE)

z_opt <- opt_soln$maximum
detailed_frame["maxent_distribution"] <- (
  z_opt * detailed_frame$recovered_distribution_1 + 
    (1 - z_opt) * detailed_frame$recovered_distribution_2)

The recovered maxent_distribution obeys the additional non-linear check to a high degree.

log(detailed_frame[["maxent_distribution"]]) %*% test_vec
##              [,1]
## [1,] 3.395224e-05

In fact, the recovered maxent_distribution is the original unobserved original P(X1=x1, X2=x2, Y=y) to many digits.

x1 x2 y P(X1=x1, X2=x2, Y=y) maxent_distribution
0 0 FALSE 0.0503403 0.0503403
1 0 FALSE 0.0014213 0.0014213
0 1 FALSE 0.5597136 0.5597135
1 1 FALSE 0.2371910 0.2371910
0 0 TRUE 0.0896597 0.0896597
1 0 TRUE 0.0585787 0.0585787
0 1 TRUE 0.0002864 0.0002865
1 1 TRUE 0.0028090 0.0028090

And these are our estimated coefficients.

recovered_coef <- suppressWarnings(
  glm(
    y ~ x1 + x2,
    data = detailed_frame,
    weights = detailed_frame[["maxent_distribution"]],
    family = binomial()
  )$coef
)

recovered_coef
## (Intercept)          x1          x2 
##   0.5772157   3.1415927  -8.1548455

This matches the correct (c0=0.5772, b1=3.1416, b2=-8.1548). We have correctly inferred the actual coefficient values from the observed data. We have removed the bias.

Why the Maximum Entropy Solution is So Good

Some calculus (in appendix) shows that the entropy function for this problem is maximized where the logarithm of the joint distribution is orthogonal to ns or test_vec. So the maximum entropy condition will enforce the extra non-linear invariant we know from our assumed problem structure.

The funny thing is, we don’t have to know exactly what the maximum entropy objective was doing to actually benefit from it. It tends to be a helpful objective in modeling. In practice we don’t usually derive test_vec but just impose the maximum entropy objective and trust that it will help.

Conclusion

By pooling observations we can recover a good estimate of a joint analysis on data that was not available to us. The strategy is: try to estimate plausible pre-images of the data that formed the observations, and then analyze that. This gives us a method to invert the bias introduced by the omitted variables in logistic regression.

In machine learning the maximum entropy principle plays the role that the stationary-action principle action plays in classic mechanics. While nature isn’t forced to put equal probabilities on different states, deterministic models must put equal probabilities on model indistinguishable states. Maximum entropy pushes solutions to such symmetries, unless there are variables to support differences. And, maximum entropy modeling is very related to logistic regression modeling.

There is, however, a danger. A naive over-reliance on the principle of indifference can lead to incorrect modeling. Nature may be able to distinguish between states that a given set of experimental variables can not. Also, the general applicability of maximum entropy techniques isn’t an excuse to not look for problem specific reasons why such an objective helps. This is what we did in this note when developing the non-linear orthogonality condition. This condition is a consequence of the fact that the logit-linear form of the logistic regression we, as the experimenter, imposed on the data. At some point we are observing the regularity of our assumptions, not of the original unobserved data.

In the real world we would at best be looking at marginalizations of different draws of related data. So we would not have exact matches we can invert- but instead would have to estimate low-discrepancy pre-images of the data. And, as we are now introducing a lot of unobserved parameters, we could go to Bayesian graphical model methods to sum this all out (instead of proposing a specific point-wise method as we did here).

We have some notes on how this method applies in a more general case
here.

Thank you to Dr. Nina Zumel for help and comments.

Appendices

Appendix: The Relation to Entropy

The maximum likelihood solution to a logistic regression problem is equivalent picking a paramaterized distribution q close to the target distribution p by minimizing the cross entropy below.

  - sumi pi log qi

When q gets close to p this looks a lot like the standard entropy below.

  - sumi pi log pi

So we do expect entropy calculations to be relevant to logistic regression structure. We will back up this claim with detailed calculation.

Appendix: test_vec is an Orthogonal Test

To show sum(test_vec * log(P(X1=x1, X2=x2, Y=y)) = 0 when P(X1=x1, X2=x2, Y=y) is the row probabilities matching a logistic model, write sum(test_vec * log(P(X1=x1, X2=x2, Y=y))) as:

sumx1=0,1 sumx2=0,1 sumy=F,T (
   (-1)x1 * (-1)x2 * (-1)y 
     * log(P(X1=x1, X2=x2) * p(Y=y | x1, x2))
   )

 = sumx1=0,1 sumx2=0,1 (
    (-1)x1 * (-1)x2 * ( 
       log(P(X1=x1, X2=x2) * 
          (1 - 1 / (1 + exp(c0 + b1 * x1 + b2 * x2))))
      - log(P(X1=x1, X2=x2) * 
          1 / (1 + exp(c0 + b1 * x1 + b2 * x2)))
    ))
    

 = sumx1=0,1 sumx2=0,1 (
    (-1)x1 * (-1)x2 * ( 
        log(P(X1=x1, X2=x2) * 
           (exp(c0 + b1 * x1 + b2 * x2) / (1 + exp(c0 + b1 * x1 + b2 * x2))))
      - log(P(X1=x1, X2=x2) * 
           1 / (1 + exp(c0 + b1 * x1 + b2 * x2)))
    ))

 = sumx1=0,1 sumx2=0,1 (
    (-1)x1 * (-1)x2 * log(exp(c0 + b1 * x1 + b2 * x2)))

 = sumx1=0,1 sumx2=0,1 (
    (-1)x1 * (-1)x2 * (c0 + b1 * x1 + b2 * x2))

 = 0

This establishes that sum(test_vec * log(P(X1=x1, X2=x2, Y=y))) = 0 for any logistic regression solution, not just the optimal one. This condition is true for our data set, as we designed it to have the structure of a logistic regression. And this shows logistic regression can not tell P(X1=x1, X2=x2, Y=y) + z * test_vec from P(X1=x1, X2=x2, Y=y), as it is blind to changes in that direction. This is why all our data pre-images yield the same logistic regression coefficients.

Appendix: the Entropy Gradient Goes to Zero at our Check Position

We can show the entropy gradient is zero at our check-gradient position. So, maximizing entropy picks the position where we meet our non-linear orthogonal check condition.

To establish this, consider the entropy function we are maximizing f(z) = -sumi (pi + z * test_veci) log(pi + z * test_veci). We expect our maximum occurs where f(z) has a zero derivative.

(d / d z) f(z) [evaluated at z = 0]

 = (d / d z) -sumi (
    pi + z * test_veci) 
      * log(pi + z * test_veci) 
    [evaluated at z = 0]
 
 = -sumi test_veci (
    log(pi + z * test_veci) + 1) 
    [evaluated at z = 0]
 
 = -sumi test_veci (log(pi) + 1)
 
 = -sumi test_veci log(pi)  
    [using -sumi test_veci = 0]

And this is zero exactly where the non-linear orthogonal check condition is zero.

The source code of this article is available here (plus render here).

Categories: Exciting Techniques Tutorials

Tagged as:

John Mount

Leave a Reply

%d bloggers like this: