For an article on A/B testing that I am preparing, I asked my partner Dr. Nina Zumel if she could do me a favor and write some code to produce the diagrams. She prepared an excellent parameterized diagram generator. However being the author of the book Practical Data Science with R, she built it in R
using ggplot2. This would be great, except the A/B testing article is being developed in Python, as it targets programmers familiar with Python.
As the production of the diagrams is not part of the proposed article, I decided to use the rpy2
package to integrate the R
diagrams directly into the new worksheet. Alternatively, I could translate her code into Python using one of: Seaborn objects, plotnine, ggpy, or others. The large number of options is evidence of how influential Leland Wilkinson’s grammar of graphics (gg) is.
Let’s try the rpy2
approach.
For our example we import our modules, including a small adaptor I wrote called r_tools
.
# import our modules
import numpy as np
from IPython.display import Code, display, Image
from rpy2 import robjects
from r_tools import get_ggplot_fn_by_name
import pandas as pd
Now all we have to do use her code is:
1) Source her .R
file to load the function.
2) Get a reference to the diagram producing function.
This is done as follows.
# read the .R file into the R interpreter environment
robjects.r("source('significance_power_visuals.R')")
# get a Python reference to the sig_pow_visuals R function
sig_pow_visuals = get_ggplot_fn_by_name("sig_pow_visuals")
Now we can use the diagram code. What the A/B testing diagram is and how to pick the arguments will be the content of our later article. The thing to notice now is: we generate the diagram from the R
code, working in Python
.
# make the example diagram
n = 557
r = 0.1
t = 0.061576
power = 0.9
significance = 0.02
display(sig_pow_visuals(
stdev=np.sqrt(0.5 / n),
effect_size=r,
threshold=t,
title=f"(correct) A/B test with size={n}, decision threshold = {t:.4f} (vertical line)",
subtitle=f"Notice, 1-power = {1-power:.2f} = green area, significance = {significance:.2f} = orange area"
))
For the above to work, one must have an installed R
environment (with the appropriate packages) and a properly installed and configured rpy2
.
The above example is a bit unusual, in that it is a plot that doesn’t take an incoming data frame as an argument. More often we are likely wanting to pass a Pandas
data frame into R
. This is also quite easy.
To see this, consider another .R
file that defines the following R function.
# show the contents of a .R file
display(Code("plot_frame.R", language="R"))
library(ggplot2)
plt_frame <- function(d) {
ggplot(data = d, mapping = aes(x = x, y = y)) +
geom_point()
}
We can source this .R
file and get a reference to the desired function as before.
robjects.r("source('plot_frame.R')")
plt_frame = get_ggplot_fn_by_name(
"plt_frame",
# ggplot2::ggsave() arguments
width=3,
height=2,
units="in",
)
With the function reference in hand, we can now plot.
display(plt_frame(
pd.DataFrame({
"x": [1.0, 2.0, 3.0],
"y": [1.0, -1.0, 2.0],
})))
The adapter also adds some minimal function help (source file name and names of arguments).
help(plt_frame)
Help on function plt_frame in module r_tools: plt_frame(*args, **kwargs) -> IPython.core.display.Image imported R function plt_frame() (assumed to return a ggplot) wrapped fn returns IPython.display.Image R source file: plot_frame.R R definition environment: <environment: R_GlobalEnv> R arguments: $d
And that is one method to use R
graphing in Python or mixed-language data science projects.
All of the code in this example can be found here.
However, a common missing component remains: a general “Pythonic” data schema definition, documentation, and invariant enforcement mechanism.
It turns out it is quite simple to add such functionality using Python decorators. This isn’t particularly useful for general functions (such as pd.merge()
), where the function is supposed to support arbitrary data schemas. However, it can be very useful in adding checks and safety to specific applications and analysis workflows built on top such generic functions. In fact, it is a good way to copy schema details from external data sources such as databases or CSV into enforced application invariants. Application code that transforms fixed tables into expected exported results can benefit greatly from such schema documentation and enforcement.
I propose the following simple check criteria for both function signatures and data frames that applies to both inputs and outputs:
In this note I will demonstrate the how to add such schema documentation and enforcement to Python functions working over data frames using Python decorators.
Let’s import our modules.
# import modules
from pprint import pprint
import numpy as np
import pandas as pd
import polars as pl
import data_algebra as da
from data_algebra.data_schema import SchemaCheckSwitch
These two covariant constraints are what we need to ensure we can write the operations over columns (which we need to know exist), and to not get unexpected results (from unexpected types). Instead of getting down-stream signalling nor non-signalling errors during column operations, we get useful exceptions on columns and values. This can be particularly useful for data science code near external data sources such as databases or CSV (comma separated value) files. Many of these sources themselves have data schemas and schema documentation that one can copy into the application.
We also want to be able to turn enforcement on or off in an entire code base easily. To do this we define a indirect importer called schema_check.py
. It’s code looks like the following:
from data_algebra.data_schema import SchemaCheckSwitch
# from data_algebra.data_schema import SchemaMock as SchemaCheck
from data_algebra.data_schema import SchemaRaises as SchemaCheck
SchemaCheckSwitch().on()
Isolating these lines in a shared import lets all other code switch behavior by only editing this file.
Let’s go ahead and import that code.
# use a indirect import, so entire package behavior
# can be changed globally all at once
import schema_check
# standard define of a function
def fn(a, /, b, *, c, d=None):
"""doc"""
return d
SchemaCheck
decoration. The details of this decorator are documented here.
# same function definition, now with schema decorator
@schema_check.SchemaCheck({
'a': int,
'b': {int, float},
'c': {'x': int},
},
return_spec={'z': float})
def fn(a, /, b, *, c, d=None):
"""doc"""
return d
We are deliberately concentrating on data frames, and not the inspection of arbitrary composite Python types. This is because we what to enforce data frame or table schemas, and not inflict an arbitrary runtime type system on Python. Schemas over tables of atomic types is remains a sweet spot for data definitions.
Our decorator documentation declares that fn()
expects at least:
a
of type int
.b
of type int
or float
.c
that is a data frame (implied by the dictionary argument), and that data frame contains a column x
that has no non-null elements of type other than int
.z
that contains no non-null elements of type other than float
.This gives us some enforceable invariants that can improve our code.
We can see this repeated back in the decorator altered help()
.
# show altered help text
help(fn)
Help on function fn in module __main__: fn(a, /, b, *, c, d=None) arg specifications {'a': <class 'int'>, 'b': {<class 'float'>, <class 'int'>}, 'c': {'x': <class 'int'>}} return specification: {'z': <class 'float'>} doc
Let’s see it catch an error. We show what happens if we call fn()
with none of the expected arguments.
# catch schema mismatch
threw = False
try:
fn()
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: expected arg a missing expected arg b missing expected arg c missing
# catch schema mismatch
threw = False
try:
fn(1, 2, c=3)
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c expected a Pandas or Polars data frame, had int
# catch schema mismatch
threw = False
try:
fn(1, 2, c=pd.DataFrame({'z': [7]}))
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c missing required column 'x'
# catch schema mismatch
threw = False
try:
fn(1, 2, c=pd.DataFrame({'x': [3.0]}))
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c column 'x' expected type int, found type float
# catch schema mismatch
rv = None
threw = False
try:
fn(
1,
2,
c=pd.DataFrame({'x': [30], "z": [17.2]}),
d=pd.DataFrame({'q': [7.0]}))
except TypeError as e:
print(e.args[0])
rv = e.args[1]
threw = True
assert threw
# the return value is available for inspection
rv
fn() return value: missing required column 'z'
q | |
---|---|
0 | 7.0 |
TypeError
to help with diagnosis and debugging.
Again, these sort of checks are not for generic utility methods (such as pd.merge()
), which are designed to work over a larger variety of schemas. However, they are very useful near client interfaces, APIs, and database tables. This technique and data algebra processing may naturally live near data sources. There is a an-under appreciated design principle that package code should be generic, and application code should be specific (even in the same project).
Let’s show a successful call.
fn(
1,
b=2,
c=pd.DataFrame({'x': [3]}),
d=pd.DataFrame({'z': [7.0]}))
z | |
---|---|
0 | 7.0 |
# turn off checking globally
SchemaCheckSwitch().off()
# show wrong return value is now allowed
fn(
1,
2,
c=pd.DataFrame({'x': [30], "z": [17.2]}),
d=pd.DataFrame({'q': [7.0]}))
q | |
---|---|
0 | 7.0 |
z
column, but with checks off the function is not interfered with.
When checks are on: failures are detected much closer to causes, making debugging and diagnosis much easier. Also, the decorations are a easy way to document in human readable form some basics of the expected input and output schemas.
And, the input and output schema are attached to the function as objects.
# show argument schema specifications
pprint(fn.data_schema.arg_specs)
{'a': <class 'int'>, 'b': {<class 'float'>, <class 'int'>}, 'c': {'x': <class 'int'>}}
# show return value schema
pprint(fn.data_schema.return_spec)
{'z': <class 'float'>}
A downside is, the technique can run into what I call “the first rule of meta-programming”. Meta-programming only works as long as it doesn’t run into other meta-programming (also called the “its only funny when I do it” theorem). That being said, I feel these decorators can be very valuable in Python data science projects.
This documentation and demo can be found here.
# turn back on checking globally
SchemaCheckSwitch().on()
# failing example in Polars
threw = False
try:
fn(1, 2, c=pl.DataFrame({'z': [7]}))
except TypeError as e:
print(e)
threw = True
assert threw
function fn(), issues: arg c missing required column 'x'
# failing example in Polars
rv = None
threw = False
try:
fn(
1,
2,
c=pl.DataFrame({'x': [30], "z": [17.2]}),
d=pl.DataFrame({'q': [7.0]}))
except TypeError as e:
print(e.args[0])
rv = e.args[1]
threw = True
assert threw
# the return value is available for inspection
rv
fn() return value: missing required column 'z'
shape: (1, 1)
q |
---|
f64 |
7.0 |
# good example in Polars
fn(
1,
b=2,
c=pl.DataFrame({'x': [3]}),
d=pl.DataFrame({'z': [7.0]}))
shape: (1, 1)
z |
---|
f64 |
7.0 |
SchemaCheck
decoration is a simple and effective tool to add schema documentation and enforcement to your analytics projects.
# show some relevant versions
pprint({
'pd': pd.__version__,
'pl': pl.__version__,
'np': np.__version__,
'da': da.__version__})
{'da': '1.6.10', 'np': '1.25.2', 'pd': '2.0.3', 'pl': '0.19.2'}
Let’s continue along the lines discussed in Omitted Variable Effects in Logistic Regression.
The issue is as follows. For logistic regression, omitted variables cause parameter estimation bias. This is true even for independent variables, which is not the case for more familiar linear regression.
This is a known problem with known mitigations:
(Thank you, Tom Palmer and Robert Horton for the references!)
For this note, let’s work out how to directly try and overcome the omitted variable bias by solving for the hidden or unobserved detailed data. We will work our example in R
. We will derive some deep results out of a simple set-up. We show how to “un-marginalize” or “un-summarize” data.
For an example let’s set up a logistic regression on two explanatory variables X1
and X2
. For simplicity we will take the case where X1
and X2
only take on the values 0
and 1
.
Our data is then keyed by the values of these explanatory variables and the dependent or outcome variable Y
, which takes on only the values FALSE
and TRUE
. The keying looks like the following.
x1 | x2 | y |
---|---|---|
0 | 0 | FALSE |
1 | 0 | FALSE |
0 | 1 | FALSE |
1 | 1 | FALSE |
0 | 0 | TRUE |
1 | 0 | TRUE |
0 | 1 | TRUE |
1 | 1 | TRUE |
Note: we are using upper case names for random variables and lower case names for coresponding values of these variables.
Let’s specify the joint probability distribution of our two explanatory variables. We choose them as independent with the following expected values.
# specify explanatory variable distribution
`P(X1=1)` <- 0.3
`P(X2=1)` <- 0.8
`P(X1=0)` <- 1 - `P(X1=1)`
`P(X2=0)` <- 1 - `P(X2=1)`
Our data set can then be completely described by above explanatory variable distribution and the conditional probability of the dependent outcomes. For our logistic regression problem we set up our outcome conditioning as P(Y=TRUE) ~ sigmoid(c0 + b1 * x1 + b2 * x2)
. Our example coefficients are as follows.
# 0.5772
<- -digamma(1)) (c0
## [1] 0.5772157
# 3.1415
<- pi) (b1
## [1] 3.141593
# 27.182
<- -3 * exp(1)) (b2
## [1] -8.154845
Please remember these coefficients in this order for later.
# show constants in an order will see again
c(c0, b1, b2)
## [1] 0.5772157 3.1415927 -8.1548455
Using the methodology of Replicating a Linear Model we can build an example data set that obeys the specified explanatory variable distribution and has specified outcome probabilities. This is just us building a data set matching an assumed known answer. Our data distribution is going to be determined by P(X1=1)
, P(X2=1)
, and P(Y=TRUE) ~ sigmoid(c0 + b1 * x1 + b2 * x2)
. Our inference task is to recover the parameters P(X1=1)
, P(X2=1)
, c0
, b1
, and b2
from data, even in the situation where observers have omitted variable issues.
The complete detailed data is generated as follows. The P(X1=x1, X2=c2, Y=y)
column is what proportion of a data set drawn from this specified distribution matches the row keys x1
, x2
, y
, or is the joint probability of a given row type. We can derive all the detailed probabilities as follows.
# get joint distribution of explanatory variables
"P(X1=x1, X2=x2)"] <- (
detailed_frame[ifelse(detailed_frame$x1 == 1, `P(X1=1)`, `P(X1=0)`)
* ifelse(detailed_frame$x2 == 1, `P(X2=1)`, `P(X2=0)`)
)
# converting "links" to probabilities
<- function(x) {1 / (1 + exp(-x))}
sigmoid
# get conditional probability of observed outcome
<- sigmoid(
y_probability + b1 * detailed_frame$x1 + b2 * detailed_frame$x2)
c0
# record probability of observation
"P(Y=y | X1=x1, X2=x2)"]] <- ifelse(
detailed_frame[[$y,
detailed_frame
y_probability, 1 - y_probability)
# compute joint explanatory plus outcome probability of each row
"P(X1=x1, X2=x2, Y=y)"]] <- (
detailed_frame[["P(X1=x1, X2=x2)"]]
detailed_frame[[* detailed_frame[["P(Y=y | X1=x1, X2=x2)"]])
The following table relates x1
, x2
, y
value combinations to the P(X1=x1, X2=c2, Y=y)
column (which shows how common each such row is).
x1 | x2 | y | P(X1=x1, X2=x2) | P(Y=y | X1=x1, X2=x2) | P(X1=x1, X2=x2, Y=y) |
---|---|---|---|---|---|
0 | 0 | FALSE | 0.14 | 0.3595735 | 0.0503403 |
1 | 0 | FALSE | 0.06 | 0.0236881 | 0.0014213 |
0 | 1 | FALSE | 0.56 | 0.9994885 | 0.5597136 |
1 | 1 | FALSE | 0.24 | 0.9882958 | 0.2371910 |
0 | 0 | TRUE | 0.14 | 0.6404265 | 0.0896597 |
1 | 0 | TRUE | 0.06 | 0.9763119 | 0.0585787 |
0 | 1 | TRUE | 0.56 | 0.0005115 | 0.0002864 |
1 | 1 | TRUE | 0.24 | 0.0117042 | 0.0028090 |
For a logistic regression problem, the relation between X1
, X2
and Y
is encoded in the P(X1=x1, X2=c2, Y=y)
distribution that gives the joint expected frequency of each possible data row in a drawn sample.
We can confirm this data set encodes the expected logistic relationship by recovering the coefficients through fitting.
# suppressWarnings() to avoid fractional data weight complaint
<- suppressWarnings(
correct_coef glm(
~ x1 + x2,
y data = detailed_frame,
weights = detailed_frame[["P(X1=x1, X2=x2, Y=y)"]],
family = binomial()
$coef
)
)
correct_coef
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
Notice we recover the c0 + b1 * detailed_frame$x1 + b2 * detailed_frame$x2
form.
There is an interesting non-linear invariant the P(X1=x1, X2=c2, Y=y)
column obeys. We will use this invariant later, so it is worth establishing. The principle is: our solution disappears with respect to certain test-vectors, which will help us re-identify it later.
Consider the following test vector.
<- (
test_vec -1)^detailed_frame$x1
(* (-1)^detailed_frame$x2
* (-1)^detailed_frame$y)
test_vec
## [1] 1 -1 -1 1 -1 1 1 -1
sum(test_vec * log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]]))
is always zero when detailed_frame[["P(X1=x1, X2=x2, Y=y)"])
is the row probabilities from a logistic model of the form we have been working with. Or log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])
is orthogonal to test_vec
. We can confirm this in our case, and derive this in the appendix.
<- test_vec * log(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])
p_vec stopifnot( # abort render if claim is not true
abs(sum(p_vec)) < 1e-8)
sum(p_vec)
## [1] -2.553513e-15
Roughly: this is one check that the data is consistent with the distributions a logistic regression with independent explanatory variables can produce.
Now let’s get to our issue. Suppose we have two experimenters, each of which only observes one of the explanatory variables. As we saw in Omitted Variable Effects in Logistic Regression each of these experimenters will in fact estimate coefficients that are biased towards zero, due to the non-collapsibility of the modeling set up. This differs from linear regression, where for independent explanatory variables (as we have here) we would expect each experimenter to be able to get an unbiased estimate of the coefficient for the explanatory variable available to them!
Let’s build a linear operator that computes the margins the experimenters actually observe. We or the experimenters can specify this mapping, and its output. We just don’t (yet) have complete inforation on the pre-image of this mapping.
::kable(margin_transform, format = "html") |>
knitr::kable_styling(font_size = 10) kableExtra
P(X1=0, X2=0, Y=FALSE) | P(X1=1, X2=0, Y=FALSE) | P(X1=0, X2=1, Y=FALSE) | P(X1=1, X2=1, Y=FALSE) | P(X1=0, X2=0, Y=TRUE) | P(X1=1, X2=0, Y=TRUE) | P(X1=0, X2=1, Y=TRUE) | P(X1=1, X2=1, Y=TRUE) | |
---|---|---|---|---|---|---|---|---|
P(X1=0, X2=*, Y=FALSE) | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
P(X1=1, X2=*, Y=FALSE) | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
P(X1=0, X2=*, Y=TRUE) | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
P(X1=1, X2=*, Y=TRUE) | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
P(X1=*, X2=0, Y=FALSE) | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
P(X1=*, X2=1, Y=FALSE) | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 |
P(X1=*, X2=0, Y=TRUE) | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
P(X1=*, X2=1, Y=TRUE) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
P(X1=0, X2=0, Y=*) | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
P(X1=1, X2=0, Y=*) | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
P(X1=0, X2=1, Y=*) | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
P(X1=1, X2=1, Y=*) | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
The above matrix linearly maps our earlier P(X1=x1, X2=c2, Y=y)
columns to various interesting roll-ups or aggregations. Or, it is 12 linear checks we expect our 8 unobserved distribution parameters to obey. Unfortunately the rank of this linear transform is only 7, so there is redundancy among the checks and the linear relations do not fully specify the unobserved distribution parameters. This is why we need additional criteria to drive our solution.
# apply the linear operator to compute marginalized observations
<- margin_transform %*% detailed_frame[["P(X1=x1, X2=x2, Y=y)"]] actual_margins
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=0, X2=*, Y=FALSE) | 0 | * | FALSE | 0.6100538 |
P(X1=1, X2=*, Y=FALSE) | 1 | * | FALSE | 0.2386123 |
P(X1=0, X2=*, Y=TRUE) | 0 | * | TRUE | 0.0899462 |
P(X1=1, X2=*, Y=TRUE) | 1 | * | TRUE | 0.0613877 |
P(X1=*, X2=0, Y=FALSE) | * | 0 | FALSE | 0.0517616 |
P(X1=*, X2=1, Y=FALSE) | * | 1 | FALSE | 0.7969046 |
P(X1=*, X2=0, Y=TRUE) | * | 0 | TRUE | 0.1482384 |
P(X1=*, X2=1, Y=TRUE) | * | 1 | TRUE | 0.0030954 |
P(X1=0, X2=0, Y=*) | 0 | 0 | * | 0.1400000 |
P(X1=1, X2=0, Y=*) | 1 | 0 | * | 0.0600000 |
P(X1=0, X2=1, Y=*) | 0 | 1 | * | 0.5600000 |
P(X1=1, X2=1, Y=*) | 1 | 1 | * | 0.2400000 |
The above margin frame describes how the detailed experiment is marginalized or censored down to what different experimenters see. In our set-up experimenter 1 sees only the first four rows, and experimenter 2 sees only the next 4 rows. We consider the rest of the data “unobserved”.
We also note that margin_transform
is blind to variation in the direction of our earlier test_vec
. This can be confirmed as follows.
<- margin_transform %*% test_vec
test_map
stopifnot(
max(abs(test_map)) < 1e-8)
We knowlog(detailed_frame[["P(X1=x1, X2=x2, Y=y)"]])
is orthogonal to test_vec
, but we don’t have an obvious linear relation between detailed_frame[["P(X1=x1, X2=x2, Y=y)"])
and test_vec
.
Fortunately we can show (in an appendix) that the logistic regression is also blind in this direction, so all of the indistinguishable data pre-images give us the same logistic regression solution. Also, we can use a maximum entropy principle to correctly recover the single actual data distribution specified.
Let’s see what happens when an experimenter tries to perform inference on their fraction of the data.
# select data available to d1
<- margin_frame[
d1 $x2 == asterisk_symbol, , drop = FALSE]
margin_frame
::kable(d1) knitr
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=0, X2=*, Y=FALSE) | 0 | * | FALSE | 0.6100538 |
P(X1=1, X2=*, Y=FALSE) | 1 | * | FALSE | 0.2386123 |
P(X1=0, X2=*, Y=TRUE) | 0 | * | TRUE | 0.0899462 |
P(X1=1, X2=*, Y=TRUE) | 1 | * | TRUE | 0.0613877 |
# solve from d1's point of view
<- suppressWarnings(
d1_est glm(
~ x1,
y data = d1,
weights = d1$actual_margins,
family = binomial()
$coef
)
)
d1_est
## (Intercept) x1
## -1.9143360 0.5567057
Notice experimenter 1 got a much too small estimate of the X1
coefficient of 0.5567057, whereas the correct value is 3.1415927. From experimenter 1’s point of view, the effect of the omitted variable X2
is making X1
hard to correctly infer.
Experimenter 2 has the following portion of data, which also is not enough to get an unbiased coefficient estimate.
# select data available to d2
<- margin_frame[
d2 $x1 == asterisk_symbol, , drop = FALSE]
margin_frame
::kable(d2) knitr
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=*, X2=0, Y=FALSE) | * | 0 | FALSE | 0.0517616 |
P(X1=*, X2=1, Y=FALSE) | * | 1 | FALSE | 0.7969046 |
P(X1=*, X2=0, Y=TRUE) | * | 0 | TRUE | 0.1482384 |
P(X1=*, X2=1, Y=TRUE) | * | 1 | TRUE | 0.0030954 |
From the original data set’s point of view: both experimenters have wrong estimates of their respective coefficients. They do have correct estimates for their limited view of columns, but this is not what we are looking for when trying to infer causal effects. The question then is: if the experimenters pool their effort can they infer the correct coefficients?
Each experimenter knows a lot about the data. They known the distribution of their explanatory variable, and even the joint distribution of their explanatory and the dependent or outcome data. Assuming the two explanatory variables are independent, the experimenters can cooperate to estimate the joint distribution of the explanatory variables. We will show how to use their combined observations to estimate the hidden data elements. This data can then be used for standard detailed analysis, like we showed on the original full data set.
This isn’t the first time we have proposed a “guess at the original data, as it wasn’t shared” as we played with this in Checking claims in published statistics papers.
Our solutions strategy is as follows:
X1
and X2
from the observed marginal distributions of X1
and X2
plus an assumption of independence.margin_transform
to get a family of estimates of the original hidden data.Note this strategy biases the data recovery to data sets that match our modeling assumptions. If the original data met our modeling assumptions this is in fact a useful inductive bias. If the original data did not match the modeling assumptions, then this will (unfortunately) hide issues.
X1
and X2
joint distributionNeither experimenter observed the following part of the marginal frame:
# show x1 x2 distribution poriton of margin_frame
<- margin_frame[
dx $y == asterisk_symbol, , drop = FALSE]
margin_frame
::kable(dx) knitr
x1 | x2 | y | actual_margins | |
---|---|---|---|---|
P(X1=0, X2=0, Y=*) | 0 | 0 | * | 0.14 |
P(X1=1, X2=0, Y=*) | 1 | 0 | * | 0.06 |
P(X1=0, X2=1, Y=*) | 0 | 1 | * | 0.56 |
P(X1=1, X2=1, Y=*) | 1 | 1 | * | 0.24 |
However, under the independence assumption they can estimate it from their pooled observations as follows.
# estimate x1 x2 distribution from d1 and d2
<- aggregate(actual_margins ~ x1, data = d1, sum)
d1a <- aggregate(actual_margins ~ x2, data = d2, sum)
d2a <- merge(d1a, d2a, by = c())
dxe "estimated_margins"] <- (
dxe[$actual_margins.x * dxe$actual_margins.y)
dxe$actual_margins.x <- NULL
dxe$actual_margins.y <- NULL
dxe<- dxe[order(dxe$x2, dxe$x1), , drop = FALSE]
dxe
::kable(dxe) knitr
x1 | x2 | estimated_margins |
---|---|---|
0 | 0 | 0.14 |
1 | 0 | 0.06 |
0 | 1 | 0.56 |
1 | 1 | 0.24 |
Notice dxe
is build only from dx1
and dx2
(plus the assumed independence of X1
and X2
). At this point we have inferred the P(X1=x1, X2=x2)
parameters from the observed data.
We now combine all of our known data to get an estimate of the (unobserved) summaries produced by margin_transform
.
# put together experimenter 1 and 2's joint estimate of marginal proportions
# from data they have in their sub-experiments.
<- c(
estimated_margins $actual_margins,
d1$actual_margins,
d2$estimated_margins
dxe
)
estimated_margins
## [1] 0.610053847 0.238612287 0.089946153 0.061387713 0.051761580 0.796904554
## [7] 0.148238420 0.003095446 0.140000000 0.060000000 0.560000000 0.240000000
We see that the two experimenters have estimated the output of the margin_frame
transform. As they know the margin_frame
output and the margin_frame
operator itself, they can try to estimate the pre-image or input. This pre-image is the detailed distribution of data they are actually interested in.
We use linear algebra to pull estimated_margins
back through margin_transform
inverse to get a linear estimate of the unobserved original data.
# typical solution (in the linear sense, signs not enforced)
# remember: estimated_margins = margin_transform %*% v
<- solve(
v qr(margin_transform, LAPACK = TRUE),
estimated_margins)
v
## P(X1=0, X2=0, Y=FALSE) P(X1=1, X2=0, Y=FALSE) P(X1=0, X2=1, Y=FALSE)
## 0.047964126 0.003797454 0.562089720
## P(X1=1, X2=1, Y=FALSE) P(X1=0, X2=0, Y=TRUE) P(X1=1, X2=0, Y=TRUE)
## 0.234814833 0.092035874 0.056202546
## P(X1=0, X2=1, Y=TRUE) P(X1=1, X2=1, Y=TRUE)
## -0.002089720 0.005185167
Note this estimate has negative entries, so is not yet a sequence of valid frequencies or probabilities. We will correct this by adding elements that don’t change the forward mapping under margin_transform
. This means we need a linear algebra basis for margin_transform
‘s “null space.” This is gotten as follows. The null space calculation is the systematic way of finding blind-spots in the linear transform, without requiring prior domain knowledge.
# our degree of freedom between solutions
<- MASS::Null(t(margin_transform)) # also uses QR decomposition, could combine
ns stopifnot( # abort render if this claim is not true
ncol(ns) == 1
)
# ns is invariant under scaling, pick first coordinate to be 1 for presentation
<- ns / ns[[1]]
ns
ns
## [1] 1 -1 -1 1 -1 1 1 -1
In our case the null space was one dimensional, or spanned by a single vector. This means all valid solutions are of the form v + z * ns
for scalars z
. In fact all solutions are in an interval of z
values. We can solve for this interval.
Note, we have seen the direction we are varying (ns
) before: it is test_vec
!
The range of recovered solutions to the (unknown to either experimenter!) original data distribution details can be seen below as the recovered_distribution_*
columns.
x1 | x2 | y | P(X1=x1, X2=x2, Y=y) | recovered_distribution_1 | recovered_distribution_2 |
---|---|---|---|---|---|
0 | 0 | FALSE | 0.0503403 | 0.0500538 | 0.0517616 |
1 | 0 | FALSE | 0.0014213 | 0.0017077 | 0.0000000 |
0 | 1 | FALSE | 0.5597136 | 0.5600000 | 0.5582923 |
1 | 1 | FALSE | 0.2371910 | 0.2369046 | 0.2386123 |
0 | 0 | TRUE | 0.0896597 | 0.0899462 | 0.0882384 |
1 | 0 | TRUE | 0.0585787 | 0.0582923 | 0.0600000 |
0 | 1 | TRUE | 0.0002864 | 0.0000000 | 0.0017077 |
1 | 1 | TRUE | 0.0028090 | 0.0030954 | 0.0013877 |
The actual solution is in the convex hull of the two extreme solutions. And the logistic regression is blind to changes in the test_vec
direction (shown in appendix). So we can recover the correct logistic regression coefficients from any of these solutions.
for (soln_name in soln_names) {
print(soln_name)
suppressWarnings(
<- glm(
soln_i ~ x1 + x2,
y data = detailed_frame,
weights = detailed_frame[[soln_name]],
family = binomial()
$coef
)
)print(soln_i)
stopifnot( # abort render if this claim is not true
max(abs(correct_coef - soln_i)) < 1e-6)
}
## [1] "recovered_distribution_1"
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
## [1] "recovered_distribution_2"
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
We see, all recovered data distributions give the same correct estimates of the logistic regression coefficients.
The standard trick with an under-specified system is to add an objective. A great choice is: maximize the entropy of (or flatness of) the distribution we are solving for.
This works as follows.
<- function(v) {
entropy <- v[v > 0]
v if (length(v) < 2) {
return(0)
}<- v / sum(v)
v -sum(v * log2(v))
}
# brute force solve for maximum entropy mix
# obviously this can be done a bit slicker
<- optimize(
opt_soln function(z) {
entropy(
* detailed_frame$recovered_distribution_1 +
z 1 - z) * detailed_frame$recovered_distribution_2)},
(c(0, 1),
maximum = TRUE)
<- opt_soln$maximum
z_opt "maxent_distribution"] <- (
detailed_frame[* detailed_frame$recovered_distribution_1 +
z_opt 1 - z_opt) * detailed_frame$recovered_distribution_2) (
The recovered maxent_distribution
obeys the additional non-linear check to a high degree.
log(detailed_frame[["maxent_distribution"]]) %*% test_vec
## [,1]
## [1,] 3.395224e-05
In fact, the recovered maxent_distribution
is the original unobserved original P(X1=x1, X2=x2, Y=y)
to many digits.
x1 | x2 | y | P(X1=x1, X2=x2, Y=y) | maxent_distribution |
---|---|---|---|---|
0 | 0 | FALSE | 0.0503403 | 0.0503403 |
1 | 0 | FALSE | 0.0014213 | 0.0014213 |
0 | 1 | FALSE | 0.5597136 | 0.5597135 |
1 | 1 | FALSE | 0.2371910 | 0.2371910 |
0 | 0 | TRUE | 0.0896597 | 0.0896597 |
1 | 0 | TRUE | 0.0585787 | 0.0585787 |
0 | 1 | TRUE | 0.0002864 | 0.0002865 |
1 | 1 | TRUE | 0.0028090 | 0.0028090 |
And these are our estimated coefficients.
<- suppressWarnings(
recovered_coef glm(
~ x1 + x2,
y data = detailed_frame,
weights = detailed_frame[["maxent_distribution"]],
family = binomial()
$coef
)
)
recovered_coef
## (Intercept) x1 x2
## 0.5772157 3.1415927 -8.1548455
This matches the correct (c0=0.5772, b1=3.1416, b2=-8.1548). We have correctly inferred the actual coefficient values from the observed data. We have removed the bias.
Some calculus (in appendix) shows that the entropy function for this problem is maximized where the logarithm of the joint distribution is orthogonal to ns
or test_vec
. So the maximum entropy condition will enforce the extra non-linear invariant we know from our assumed problem structure.
The funny thing is, we don’t have to know exactly what the maximum entropy objective was doing to actually benefit from it. It tends to be a helpful objective in modeling. In practice we don’t usually derive test_vec
but just impose the maximum entropy objective and trust that it will help.
By pooling observations we can recover a good estimate of a joint analysis on data that was not available to us. The strategy is: try to estimate plausible pre-images of the data that formed the observations, and then analyze that. This gives us a method to invert the bias introduced by the omitted variables in logistic regression.
In machine learning the maximum entropy principle plays the role that the stationary-action principle action plays in classic mechanics. While nature isn’t forced to put equal probabilities on different states, deterministic models must put equal probabilities on model indistinguishable states. Maximum entropy pushes solutions to such symmetries, unless there are variables to support differences. And, maximum entropy modeling is very related to logistic regression modeling.
There is, however, a danger. A naive over-reliance on the principle of indifference can lead to incorrect modeling. Nature may be able to distinguish between states that a given set of experimental variables can not. Also, the general applicability of maximum entropy techniques isn’t an excuse to not look for problem specific reasons why such an objective helps. This is what we did in this note when developing the non-linear orthogonality condition. This condition is a consequence of the fact that the logit-linear form of the logistic regression we, as the experimenter, imposed on the data. At some point we are observing the regularity of our assumptions, not of the original unobserved data.
In the real world we would at best be looking at marginalizations of different draws of related data. So we would not have exact matches we can invert- but instead would have to estimate low-discrepancy pre-images of the data. And, as we are now introducing a lot of unobserved parameters, we could go to Bayesian graphical model methods to sum this all out (instead of proposing a specific point-wise method as we did here).
We have some notes on how this method applies in a more general case
here.
Thank you to Dr. Nina Zumel for help and comments.
The maximum likelihood solution to a logistic regression problem is equivalent picking a paramaterized distribution q
close to the target distribution p
by minimizing the cross entropy below.
- sum_{i} p_{i} log q_{i}
When q
gets close to p
this looks a lot like the standard entropy below.
- sum_{i} p_{i} log p_{i}
So we do expect entropy calculations to be relevant to logistic regression structure. We will back up this claim with detailed calculation.
test_vec
is an Orthogonal TestTo show sum(test_vec * log(P(X1=x1, X2=x2, Y=y)) = 0
when P(X1=x1, X2=x2, Y=y)
is the row probabilities matching a logistic model, write sum(test_vec * log(P(X1=x1, X2=x2, Y=y)))
as:
sum_{x1=0,1} sum_{x2=0,1} sum_{y=F,T} ( (-1)^{x1} * (-1)^{x2} * (-1)^{y} * log(P(X1=x1, X2=x2) * p(Y=y | x1, x2)) ) = sum_{x1=0,1} sum_{x2=0,1} ( (-1)^{x1} * (-1)^{x2} * ( log(P(X1=x1, X2=x2) * (1 - 1 / (1 + exp(c0 + b1 * x1 + b2 * x2)))) - log(P(X1=x1, X2=x2) * 1 / (1 + exp(c0 + b1 * x1 + b2 * x2))) )) = sum_{x1=0,1} sum_{x2=0,1} ( (-1)^{x1} * (-1)^{x2} * ( log(P(X1=x1, X2=x2) * (exp(c0 + b1 * x1 + b2 * x2) / (1 + exp(c0 + b1 * x1 + b2 * x2)))) - log(P(X1=x1, X2=x2) * 1 / (1 + exp(c0 + b1 * x1 + b2 * x2))) )) = sum_{x1=0,1} sum_{x2=0,1} ( (-1)^{x1} * (-1)^{x2} * log(exp(c0 + b1 * x1 + b2 * x2))) = sum_{x1=0,1} sum_{x2=0,1} ( (-1)^{x1} * (-1)^{x2} * (c0 + b1 * x1 + b2 * x2)) = 0
This establishes that sum(test_vec * log(P(X1=x1, X2=x2, Y=y))) = 0
for any logistic regression solution, not just the optimal one. This condition is true for our data set, as we designed it to have the structure of a logistic regression. And this shows logistic regression can not tell P(X1=x1, X2=x2, Y=y) + z * test_vec
from P(X1=x1, X2=x2, Y=y)
, as it is blind to changes in that direction. This is why all our data pre-images yield the same logistic regression coefficients.
We can show the entropy gradient is zero at our check-gradient position. So, maximizing entropy picks the position where we meet our non-linear orthogonal check condition.
To establish this, consider the entropy function we are maximizing f(z) = -sum_{i} (p_{i} + z * test_vec_{i}) log(p_{i} + z * test_vec_{i})
. We expect our maximum occurs where f(z)
has a zero derivative.
(d / d z) f(z) [evaluated at z = 0] = (d / d z) -sum_{i} ( p_{i} + z * test_vec_{i}) * log(p_{i} + z * test_vec_{i}) [evaluated at z = 0] = -sum_{i} test_vec_{i} ( log(p_{i} + z * test_vec_{i}) + 1) [evaluated at z = 0] = -sum_{i} test_vec_{i} (log(p_{i}) + 1) = -sum_{i} test_vec_{i} log(p_{i}) [using -sum_{i} test_vec_{i} = 0]
And this is zero exactly where the non-linear orthogonal check condition is zero.
The source code of this article is available here (plus render here).
]]>Given a set of data in R^{n}, the goal of PCA is to find the best projection of that data into R^{k }(where k<n): that is, the projection into R^{k } that preserves as much of the distance information between the original points as it can. The assumption is generally that even though the data is described in n dimensions, it “really lives” in a smaller k-dimensional hyperplane, and any variation in the other n-k dimensions is just noise.
Mathematically, you can think of a dataset X in R^{n} (where each row of X is an n-dimensional datum) as being roughly described by a ellipsoid in R^{n} that is formed by the matrix X^{T}X^{1}. PCA finds the axes of this ellipsoid, sorted by their radii (longest first); rotates the ellipsoid to be axis-aligned (so that the longest axis is now the x axis); then “flattens” (projects) the data down to the hyperplane described by the first k axes.
The trick, of course, is to find the right k: that is, the k dimensions that capture all the important information in the data.
A sphering transformation (to be precise, the sphering transformation called PCA whitening) also finds this hyperellipsoid of X and rotates it to be axis aligned. But instead of projecting the ellipsoid down to a lower dimensional space, sphering instead “reshapes” the ellipsoid into the unit sphere.^{2} This reshaping tends to shrink the directions the data already has a lot of variation in (the long axes of the ellipsoid), and stretch the directions where the data does not vary much (the short axes of the ellipsoid), with the result that the expected squared norm of a transformed datum is one. You can think of this stretching/shrinking of an axis x_{i} as being proportional to 1/sqrt(s_{i}), where s_{i} is the ith singular value of X^{T}X (the radius of the ith axis).^{3}
Rather than another drawing, let’s show an example. The full code for this example can be found here.
# build some example data
def generate_ellipse(n_rows: int, mix: float = 1e-2):
# build some example data.
# mostly varies on the line x=y, with a small perpendicular component
= rng.normal(size=n_rows)
v1 = rng.normal(size=n_rows)
v2 = pd.DataFrame({
d 'x': v1,
'y': v1 + mix * v2,
})return d
= 200
n_rows = generate_ellipse(n_rows) d_train
x=y
, with a tiny bit of variation in the perpendicular direction – a really skinny ellipsoidOur sphering transform code can be found here.
# Our function to fit a sphering transform.
= SpheringTransform()
st
st.fit(d_train)
# transform the training data
= st.transform(d_train) xformed_train
Why do we want to sphere-transform our data? One reason is that transforming the data can make it easier to detect whether a new set of data, W, has the same distribution as X. The sphering transform fixes issues of units and linearly correlated variables. It “sharpens” our statistical view on which directions of variation are common, and which are rare.
Let’s call X our reference dataset. We can learn a sphering transform from X, then apply that transform to X, as we did above. Let’s call the transformed data set X_{T}. We can then get the distribution of the norms of the datums x_{T} in X_{T}. Let’s call that distribution L_{X}.
= norm(xformed_train, axis=1) xformed_train_norms
If we transform a new data set W using the sphere-transform we learned from X, and W was drawn from the same distribution as X, then L_{W} should be the same as L_{X}.
# data generated from the same distribution as the training data
= generate_ellipse(n_rows)
d_test = st.transform(d_test)
xformed_test = norm(xformed_test, axis=1) xformed_test_norms
But if W was drawn from a different distribution, one that varies more in directions where X does not, then the norms of w_{T} will tend to be longer than those of x_{T}. And if W was drawn from a distribution that varies less in directions where X varied widely, then the norms of w_{T} will tend to be shorter than those of x_{T}. Either way, L_{W} will be different from L_{X}.
Let’s see an example of this. Here we’ll generate a data set that is still mostly aligned to the x=y
axis, but has a larger perpendicular component.
# the new data is still mostly aligned to x=y,
# but has a larger perpendicular component
= generate_ellipse(n_rows, mix=0.1) d_test_different
# transform the new data, and get the norm distribution
= st.transform(d_test_different)
xformed_test_different = norm(xformed_test_different, axis=1) xformed_test_different_norms
We can look at the distributions of the data norms in the original (not transformed) space. The distributions don’t seem that different.
However, the sphering transform highlights differences.
Even visually, the difference between the distributions L_{X} and L_{W} is more striking than the difference between the distributions (scatterplots) of the datasets X and W. In other words, we’ve turned the fairly hairy problem of detecting the differences of multivariate distributions (or multivariate distribution drift) into the much simpler problem of detecting univariate distribution drift.
Quantitatively, there are a variety of ways to measure the difference of two univariate distributions. Common measures include the Kolmogorov-Smirnov test, Kullback-Leibler divergence, Jensen-Shannon divergence, and the Population Stability Index. You can pick the measure that is most appropriate for your specific problem.
If you are interested in exploring sphering transformations for yourself, you can find the code that we used for this article at our GitHub repository. The repo includes:
sphering_transform.py
: the module that implements the transformFor fun, we’ve also attached some simple example applications of the sphering transform
I also want to mention another PCA-based approach to detecting differences in multivariate distributions: reconstruction error. This is the approach taken by nannyML’s multivariate drift detector. The article I’ve linked to gives a more detailed explanation, but essentially the method uses PCA to project the data down to its k-dimensional “signal” hyperplane, than projects the transformed data back into the full n dimensions. The reconstruction error is the difference (or the norms of the difference vectors) between the original datums and their reconstructions.
By learning the PCA transform on a reference set, one can then compare the distribution of the reconstruction error on the reference data to the reconstruction error on new data.
The sphering transform is a useful tool for the data scientist, especially when working on drift detection.
The analysis requires X to be centered at the origin. Our code makes sure to do this.↩
Keeping all the singular values may seem dangerous. However, at worst we are just asking the software to build bases for the column space and complementary null space.↩
The above discussion assumes that X^{T}X is full rank. If it is not, we can make it full rank by adding a tiny copy of the identity matrix to it. This regularization will fuzz the transformation a little bit, but preserves its most important properties. Again, our code makes sure to take this step.↩
I would like to illustrate a way which omitted variables interfere in logistic regression inference (or coefficient estimation). These effects are different than what is seen in linear regression, and possibly different than some expectations or intuitions.
Let’s start with a data example in R
.
x_frame
is a data.frame
with a single variable called x
, and an example weight or row weight called wt
.
omitted_frame
is a data.frame
with a single variable called omitted
, and an example weight called wt
.
For our first example we take the cross-product of these data frames to get every combination of variable values, and their relative proportions (or weights) in the joined data frame.
# combine frames by cross product, and get new relative data weights
d <- merge(
x_frame,
omitted_frame,
by = c())
d$wt = d$wt.x * d$wt.y
d$wt <- d$wt / sum(d$wt)
d$wt.x <- NULL
d$wt.y <- NULL
x | omitted | wt |
---|---|---|
-2 | -1 | 0.25 |
1 | -1 | 0.25 |
-2 | 1 | 0.25 |
1 | 1 | 0.25 |
The idea is: d
is specifying what proportion of an arbitrarily large data set (with repeated rows) has each possible combination of values. For us, d
is not a sample- it is an entire population. This is just a long-winded way of trying to explain why we have row weights and why we are not concerned with observation counts, uncertainty bars, or significances/p-values for this example.
Let’s define a few common constants: Euler's constant
, pi
, and e
.
## [1] 0.5772157
## [1] 3.141593
## [1] 2.718282
Please remember these constants in this order for later.
## [1] 0.5772157 3.1415927 2.7182818
For our example we call our outcome (or dependent variable) y_linear
. We say that it is exactly the following linear combination of a constant plus the variables x
and omitted
.
# assign an example outcome or dependent variable
d$y_linear <- Euler_constant + pi * d$x + e * d$omitted
x | omitted | wt | y_linear |
---|---|---|---|
-2 | -1 | 0.25 | -8.424252 |
1 | -1 | 0.25 | 1.000527 |
-2 | 1 | 0.25 | -2.987688 |
1 | 1 | 0.25 | 6.437090 |
As we expect, linear regression can recover the constants of the linear equation from data.
## (Intercept) x omitted
## 0.5772157 3.1415927 2.7182818
Notice the recovered coefficients are the three constants we specified.
This is nice, and as expected.
Now we ask: what happens if we omit from the model the variable named “omitted
”? This is a central problem in modeling. We are unlikely to know, or be able to measure, all possible explanatory variables in many real world settings. We are often omitting variables, as we don’t know about them or have access to their values!
For this linear regression model, we do not expect omitted variable bias as the variables x
and omitted
, by design, are fully statistically independent.
We can confirm omitted
is nice, in that it is mean-0
and has zero correlation with x
under the specified data distribution.
## [1] 0
x | omitted | |
---|---|---|
x | 3 | 0.000000 |
omitted | 0 | 1.333333 |
All of this worrying pays off. If we fit a model with the omitted
variable left out, we still get the original estimates of the x
-coefficient and the intercept.
## (Intercept) x
## 0.5772157 3.1415927
Let’s convert this problem to modeling the probability distribution of a new outcome variable, called y_observed
that takes on the values TRUE
and FALSE
. We use the encoding strategy from “replicate linear models” (which can simplify steps in many data science projects). How this example arises isn’t critical, we want to investigate the properties of this resulting data. So let’s take a moment and derive our data.
# converting "links" to probabilities
sigmoid <- function(x) {1 / (1 + exp(-x))}
d$y_probability <- sigmoid(d$y_linear)
# encoding effect as a probability model over a binary outcome
# method used for model replication
# ref: https://win-vector.com/2019/07/03/replicating-a-linear-model/
d_plus <- d
d_plus$y_observed <- TRUE
d_plus$wt <- d_plus$wt * d_plus$y_probability
d_minus <- d
d_minus$y_observed <- FALSE
d_minus$wt <- d_minus$wt * (1 - d_minus$y_probability)
d_logistic <- rbind(d_plus, d_minus)
d_logistic$wt <- d_logistic$wt / sum(d_logistic$wt)
x | omitted | wt | y_linear | y_probability | y_observed |
---|---|---|---|---|---|
-2 | -1 | 0.0000549 | -8.424252 | 0.0002194 | TRUE |
1 | -1 | 0.1827905 | 1.000527 | 0.7311621 | TRUE |
-2 | 1 | 0.0119963 | -2.987688 | 0.0479852 | TRUE |
1 | 1 | 0.2496004 | 6.437090 | 0.9984015 | TRUE |
-2 | -1 | 0.2499451 | -8.424252 | 0.0002194 | FALSE |
1 | -1 | 0.0672095 | 1.000527 | 0.7311621 | FALSE |
-2 | 1 | 0.2380037 | -2.987688 | 0.0479852 | FALSE |
1 | 1 | 0.0003996 | 6.437090 | 0.9984015 | FALSE |
The point is: this data has our original coefficients encoded in it as the coefficients of the generative process for y_observed
. We confirm this by fitting a logistic regression.
# infer coefficients from binary outcome
# suppressWarnings() only to avoid "fractional weights message"
suppressWarnings(
glm(
y_observed ~ x + omitted,
data = d_logistic,
weights = d_logistic$wt,
family = binomial(link = "logit")
)$coef
)
## (Intercept) x omitted
## 0.5772151 3.1415914 2.7182800
Notice we recover the same coefficients as before. We could use these inferred coefficients to answer questions about how probabilities of outcomes varies with changes in variables in the data.
Now, let’s try to (and fail to) repeat our omitted variable experiment.
First we confirm omitted
is mean zero and uncorrelated with our variable x
, even in the new data set and new row weight distribution.
## [1] 1.50162e-17
# check uncorrelated
knitr::kable(
cov.wt(
d_logistic[, c('x', 'omitted')],
wt = d_logistic$wt
)$cov
)
x | omitted | |
---|---|---|
x | 2.882739 | 0.000000 |
omitted | 0.000000 | 1.281217 |
We pass the check. But, as we will see, this doesn’t guarantee non-entangled behavior for a logistic regression.
# infer coefficients from binary outcome, with omitted variable
# suppressWarnings() only to avoid "fractional weights message"
suppressWarnings(
glm(
y_observed ~ x,
data = d_logistic,
weights = d_logistic$wt,
family = binomial(link = "logit")
)$coef
)
## (Intercept) x
## 0.00337503 1.85221234
Notice the new x
coefficient is nowhere near the value we saw before.
A stern way of interpreting our logistic experiment is:
For a logistic regression model: an omitted explanatory variable can bias other coefficient estimates. This is true even when the omitted explanatory variable is mean zero, symmetric, and uncorrelated with the other model explanatory variables. This differs from the situation for linear models.
Another way of interpreting our logistic experiment is:
For a logistic regression model: the correct inference for a given explanatory variable coefficient often depends on what other explanatory variables are present in the model.
That is: we didn’t get a wrong inference. We just got a different one, as we are inferring in a different situation. The fallacy was thinking a change in variable value has the same effect no matter what the values of other explanatory variables are. This is not the case for logistic regression, due to the non-linear shape of the logistic curve.
Diagrammatically what happened is the following.
In the above diagram we portray the sigmoid()
, or logistic curve. The horizontal axis is the linear or “link space” for predictions and the vertical axis is the probability or response space for predictions. The curve is the transform the logistic regression’s linear or link prediction is run through to get probabilities or responses. On this curve we have added as dots the four different combinations of values for x
and omitted
in our data set. The dots attached by lines differ only by changes in omitted
, i.e. those that have given value for x
.
Without the extra variable omitted
we can’t tell the joined pairs apart, and we are forced to use compromise effect estimates. However, the amount of interference is different for each value of x
. For x = -2
, the probability is almost determined, and omitted
changes little. For x = 1
things are less determined, and omitted
can have a substantial effect. How much observed probability effect x
has depends inversely on how deep and often the value of omitted
chases one into the flat regions of the sigmoid, which obscures results much like a statistical interaction would (though by different mechanisms).
This is a common observation in logistic regression: you can’t tell if a variable and coefficient have large or small effects without knowing the specific values of the complementary explanatory variables.
You get different estimates for variables depending on what other variables are present in a logistic regression model. This looks a lot like an interaction, and leads to effects similar to omitted variable bias. This happens more often than in linear regression models. This is also interpretable as: different column-views of the data having fundamentally different models.
A possible source of surprise is: appealing to assumed independence is a common way of assuring one is avoiding issues such as Simpson’s paradox in linear regression modeling. Thus it is possible an “independence implies non-interference” intuition is part of some modeler’s toolboxes.
In conclusion: care has to be taken in taking inferred logistic coefficients out of their surrounding context. The product of a logistic regression coefficient and matching value is not directly an effect size outside of context, this differs from the case for linear regression. In logistic regression, omitted variables tend to push coefficient estimates towards zero.
What are your opinions/experience? Some questions I feel are relevant include:
x
-coefficient in the logistic regressions? 3.1415
, 1.8522
, both, or neither?R
source for this article can be found here.
This package is useful in converting Jupiter notebooks to/from python, and also in rendering many parameterized notebooks. The idea is to make Jupyter notebook easier to use in production.
The latest feature is an extension of notebook parameterization. In addition to the init_code
and output_suffix
features, which allow adding arbitrary code to notebooks and saving multiple renders of the same notebook under different (non-coliding!) names. The new sheet_vars
feature allows the insertion of arbitrary data into notebook renders (in addition to the earlier code insertion facility).
Let’s work through this with an example. We start with a notebook we wish to render with different parameters. For example, suppose each notebook is processing a few files; and we want to break the processing up into many renders to parallelize the task. Our example task notebook is here:
The notebook refers to an, at this point, undefined variable named sheet_vars
. To debug this notebook we would define this variable in run the notebook in JupyterLab, VSCode, or other tools. When moving to production we would remove the debug setting and use wvpy to run the processes.
We would then use a process similar to the following notebook to run our jobs.
The user’s job is to define the “Jupyter tasks” and the rest is handled by wvpy. The first task renders as follows.
The data is moved from the driver to the task notebook through a temporary pickle file. The wvpy package inserts the pickle loading code at the top of the notebook. Notice this notebook processes "fname_1.txt"
and "fname_2.txt"
. In production we are likely running notebooks largely for their side effects (reading, processing, and writing data) not for the HTML results.
However, if we want cleaner HTML results, one can turn off input cell rendering and get a cleaner result, as we see in the second result here:
All of the above examples are available here. I have used this lightweight system successfully in a number of projects, and hope you find it useful in your work.
]]>1994 had an exciting moment when Fred Galvin solved the 1979 Jeff Dinitz conjecture on list-coloring Latin squares.
Latin squares are a simple predecessor to puzzles such as Soduko. A Latin square is an n by n grid of the integers 0 through n-1 (called “colors”) such that no row or column has any repeated integers. For example, here is a 2 by 2 Latin square.
1 | 0 |
0 | 1 |
Latin squares have their uses in experimental design, and power a number of interesting puzzles and questions.
One of the great properties of Latin squares is: we know how to fill them in row by row. If the top r-rows of partially filled in table look like they come from a Latin square, we can fill in more rows to complete the square. We can’t get stuck!
In terms of our Latin square, if we started with the following partial fill-in.
1 | 0 |
? | ? |
We can fill in the missing entries to complete the square. This was proven by Marshall Hall in 1945. The proof technique uses important and beautiful ideas about distinct representatives (one of the core ideas of combinatorics). It tells us we can find rows to continue a partial fill in. Finding continuing rows involves computing matchings, but those algorithms are considered very well understood.
There are even many ways to fill in a Latin square all at once. For example assigning cell(i, j) = (i + j) % n
is a nice solution.
In 1979 Jeff Dinitz considered a variation of the Latin square problem called list coloring. In our earlier problem each cell we filled in for our n by n Latin square could pick from the same set of “colors”: 0 through n-1. In list coloring each cell has its own list of allowed colors.
Consider the following list coloring example specification of sets we are allowed to pick from for each cell (the “lists”):
{0, 2} | {0, 1} |
{0, 2} | {1, 2} |
The question in general is: if all the lists are of at least size n, then is there at least one valid list coloring?
This list coloring Latin square specification does have a valid list coloring solution:
2 | 0 |
0 | 2 |
However, unlike the non-list coloring, we can get stuck with a seemingly harmless partial fill-in. Even picking 0 for the upper left corner ruins the problem, as we can’t extend that partial solution in into a full solution.
What happened is, the heterogeneous lists lost us a lot of the symmetries that had made the earlier problem “easy.”
Galvin’s breakthrough was to propose a viable filling-in technique. The write up is excellent (1), and there are already a number amazing appreciations of Galvin’s proof technique (1, 2, 3). However, let’s ignore the proofs and consider only the algorithm.
Galvin’s algorithm is the following. Starting with an empty partial fill-in.
And that is it. The partial fill in is done by colors instead of rows. We do still need to know what a graph kernel or stable marriage is, and how to identify such.
For the stable marriage we take a standard latin square and use it to annotate edges onto our proposed list colorable later square. We have some freedom here, and we have picked our edges so that each cell points to a lower value in its column and a higher value in its row. We illustrate this below (combining our original regular Latin square fill-in with our list coloring requirements):
The magic of this orientation is: each cell has exactly n-1 incoming and n-1 outgoing arrows. The genius of the orientation is: each cell is responsible only for picking a color that avoid conflicts with its outgoing arrows. Cells that point to a given cell take on the responsibility of avoiding conflicts with the cell. As each cell starts with n colors and n-1 outgoing arrow responsibilities, it looks like we have enough degrees of freedom to color. And this is in fact the case, as our coloring strategy maintains the invariant: each cell has at least one more color available than responsibilities.
The tool to maintain the above invariant is the stable marriage or graph kernel. In our case a stable marriage or a graph kernel is exactly a set of cells with a given color option such that:
In our example the upper right and lower left cells are a stable marriage for the color 0. Neither of these cells directly points to each other, and the only other cell that is considering color zero points to one of our cells.
This gives us why the iterative coloring scheme works: when we assign a color the only cells we miss are ones that can not use the color! They lose one of their outgoing options, but they also lose cells they have to avoid. Each cell starts with n-1 outgoing arrows, and enough colors to avoid color conflicts with along all of these arrows. The iterative coloring process preserves the “enough available colors to avoid conflicts with remaining active outgoing arrows” invariant. And thus the coloring process is sound (but here we are getting into the proof!).
The Gale–Shapley stable marriage algorithm is itself an explainable and interesting algorithm (part of the Shapley and Roth the Nobel Prize in Economics of 2012).
It is a bit of a surprise that the Galvin proof is so constructive with an explicit efficient algorithm. List coloring and kernel style problems (graph independent sets, graph dominating sets) are commonly thought to not have efficient algorithms in general, as they are NP-hard. However, the instances we encounter here are all provably “easy.”
For fun I am sharing a Python implementation of Galvin’s algorithm here. It can solve problems such as our example as follows.
echo '[[{0, 2}, {0, 1}], [{0, 2}, {1, 2}]]' | python Galvin.py
# [[2, 0], [0, 1]]
This is a second valid solution to the original list coloring problem. [[2, 1], [0, 2]]
is also a solution, so even this “restricted” 2 by 2 system has more solutions than the non-list coloring problem of the same size. This is the sense that we think list coloring is “easier” than non-list coloring. Remember: Dinitz’s conjecture was that the list coloring was always non-empty for color lists of size at least n. Showing list coloring has even more solutions than the standard Latin square coloring problem would be a much stronger result!
We share some counts and more examples here.
Hopefully this shares some of the joy of constructive (or algorithmic) combinatorics.
]]>For a n-dimensional vector with unit L2 norm we can see L1 norms as small as 1 (for the (1, 0, ..., 0)
vector), or as large as sqrt(n)
(for the (sqrt(n), ..., sqrt(n))
vector). Some of the situations are indicated in the following diagram.
What we were able to prove here is that for large n
the expected L1 norm approaches sqrt(2 n / π)
(within a constant multiple of the maximum possible) and the variance of this is approaching 1 - 3 / π
.
The constant variance means this distribution is tightly concentrated around its mean.
Kind of a cool fact to know.
]]>1/2 + arctan(1/sqrt(π - 3)) / π
(≅ 0.8854404657887897
) used a few nifty lemmas. One of which I am calling “the sign tilting lemma.”
The sign tilting lemma is:
For X, Y independent mean zero normal random variables with known variances
s_{x}^{2}
ands_{y}^{2}
, what isP[(X + Y ≥ 0) = (X ≥ 0)]
?
It turns out it is: 1/2 + arctan(s_{x} / s_{y}) / π
.
This is solving: how much does knowing a fraction of the variance tell you about the sign of a sum?
Let’s check if this is right in a few cases.
s_{x} / s_{y}
large we expect P[(X + Y ≥ 0) = (X ≥ 0)]
near 1 (X
dominates X + Y
, making X
and X + Y
highly correlated). Our formula agrees with this.s_{x} / s_{y}
near zero we expect P[(X + Y ≥ 0) = (X ≥ 0)]
near 1/2 (Y
dominates X + Y
, making X
and X + Y
nearly independent). Our formula agrees with this.s_{x} / s_{y} = 1
we expect P[(X + Y ≥ 0) = (X ≥ 0)] = 3/4
. Our formula agrees with this.The reason we have that s_{x} / s_{y} = 1
implies P[(X + Y ≥ 0) = (X ≥ 0)] = 3/4
is as follows. We only get inequality when X
and Y
have different signs (chance 1/2 of that by symmetry of X
to Y
) and |Y| > |X|
(an independent chance 1/2 of that, again by a symmetry argument). So the mismatch chance is 1/4, meaning the match chance is 3/4. In this case the following diagram shows that 3/4 of the angles between (0, 0)
and (x, y)
are in the shaded regions where sign(x) = sign(x + y)
And this completes the s_{x} / s_{y} = 1
case, as in this case (x, y)
is spherically symmetric (generates all angles uniformly at random).
For the general case we convert the question of if s_{x} u ≥ s_{y} v
(for u
, v
independent normal mean zero variance 1 random variables) to checking if the point (s_{x} u, s_{y} v)
is above or below the line x + y = 0
. This in turn is equivalent to checking if the point (u, v)
is above or below the line s_{y} x + s_{x} y = 0
(notice the swap of coefficients, as this is the algebra for checking if a point is above or below a line!). This swap is a bit tricky to visualize, it is a bit of how geometry and algebra differ in viewpoint.
The advantage of the last check being: we are again in spherically symmetric situation with all angles generated uniformly. We can illustrate the geometry as follows. We put on the x-axis u = X / s_{x}
, and on the y-axis v = Y / s_{y}
to get the uniform distribution on angles. In this notation we use a = 1 / s_{x}
and b = 1 / s_{y}
. This can be very confusing, as we get one reversal from how we check above/below lines and another (cancelling reversal) by inverting (u, v) → (X, Y)
to (X, Y) → (u, v)
. In honest practice, we use the diagram to work out the answer has to have arctan(s_{x} / s_{y})
or arctan(s_{y} / s_{x})
in it, and pick the one that works (though in math one has to pretend to never have such difficulty or to use a dirty move to get out of it!).
Adding up the indicated areas of the shaded regions completes our argument for the lemma.
To conclude, we can rephrase or lemma as follow.
]]>If a mean zero normal random variable determines a
f
fraction of the variance of a mean zero sum of independent normal random variables, then its sign matches the sign of the sum a1/2 + arctan(sqrt(f / (1-f))) / π
fraction of the time. That is: it gives aarctan(sqrt(f / (1-f))) / π
advantage in guessing the sign of the sum. For smallf
this advantage is approximatelysqrt(f) / π
.
Barry Rowlingson and John Mount asked the following question.
Generate vectors
v_{1}
andv_{2}
in R^{n} with each coordinate generated IID normal mean zero, standard deviation 1. This is a common way to generate vectors with a uniform spherical distribution. Letp_{n}
denote the probability that(||v_{1}||_{1} ≥ ||v_{2}||_{1}) = (||v_{1}||_{2} ≥ ||v_{2}||_{2})
. What islim_{n → ∞} p_{n}
?
It turns out the answer is: 1/2 + arctan(1/sqrt(π - 3)) / π ≅ 0.8854404657887897
. I’ve taken to calling this the “L1L2 AUC” or concordance. This is not the first value I guessed.
The rather long (and brutal) argument chain to establish this can be found here. Along the way we had to solve for the expected L1 norm of a vector with unit L2 norm, and also work out P[(X + Y ≥ 0) = (X ≥ 0)]
(for X, Y independent mean zero normal random variables with known variances, we call this the sign tilting lemma).
It was great to get the old “conjecture and prove/disprove” engine spinning again.
]]>