The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can […]
Estimated reading time: 14 minutes
Introduction Let’s continue along the lines discussed in Omitted Variable Effects in Logistic Regression. The issue is as follows. For logistic regression, omitted variables cause parameter estimation bias. This is true even for independent variables, which is not the case for more familiar linear regression. This is a known problem […]
Estimated reading time: 34 minutes
Many people who work with data are familiar with Principal Components Analysis (PCA): it’s a linear transformation technique that’s commonly used for dimension reduction, as well as for the orthogonalization of data prior to downstream modeling or analysis. In this article, we’ll talk about another PCA-style transformation: the sphering or […]
Estimated reading time: 11 minutes
Introduction I would like to illustrate a way which omitted variables interfere in logistic regression inference (or coefficient estimation). These effects are different than what is seen in linear regression, and possibly different than some expectations or intuitions. Our Example Data Let’s start with a data example in R. # […]
Estimated reading time: 14 minutes
I’d like to share a great new feature in the wvpy package (available at PyPi). This package is useful in converting Jupiter notebooks to/from python, and also in rendering many parameterized notebooks. The idea is to make Jupyter notebook easier to use in production. The latest feature is an extension […]
Estimated reading time: 2 minutes
(Still on my math streak.) 1994 had an exciting moment when Fred Galvin solved the 1979 Jeff Dinitz conjecture on list-coloring Latin squares. Latin squares are a simple predecessor to puzzles such as Soduko. A Latin square is an n by n grid of the integers 0 through n-1 (called […]
Estimated reading time: 9 minutes
The proof technique for “How often do the L1 and L2 norms agree?” used as a lemma a characterization of the L1 norm of n-dimensional vectors chosen uniformly with L2 norm equal to 1. For a n-dimensional vector with unit L2 norm we can see L1 norms as small as […]
Estimated reading time: 1 minute
Establishing the “L1L2 AUC” equals 1/2 + arctan(1/sqrt(π – 3)) / π (≅ 0.8854404657887897) used a few nifty lemmas. One of which I am calling “the sign tilting lemma.” The sign tilting lemma is: For X, Y independent mean zero normal random variables with known variances sx2 and sy2, what […]
Estimated reading time: 5 minutes
Turns out that I am still on a recreational mathematics run. Here is one I have been working on, arising from trying to explain norms and data science. Barry Rowlingson and John Mount asked the following question. Generate vectors v1 and v2 in Rn with each coordinate generated IID normal […]
Estimated reading time: 1 minute
Just coming back from a vacation where I got some side-time to work some recreational math problems. One stood out, packing vector sums by re-ordering. I feel you don’t deeply understand a proof until you try to work examples and re-write it, so here (for me) it is: Picking Vectors […]
Estimated reading time: 26 seconds