## The Biased Drunkard’s Walk

Our “Markov Chains leading up to A/B tests” series continues with The Biased Drunkard’s Walk. In this note we use the theory of Toeplitz matrices to analyze a variant of the drunkard’s walk that I am calling a “first to w tournament.” We can get not only the probability of […]

## A Deep Dive on The Drunkard’s Walk

I continue my series on the mathematics of Markov chains with a deep dive on The Drunkard’s Walk. This is a set-up for more on Wald’s Sequential Analysis (a near relative of A/B Tests). A great thing is: it explains some of the perceptual effects in Nina Zumel’s animations in […]

## A Slightly Unfair Game

I’ve been trying to write directly about some of the aspects of controlled experiments and A/B testing recently. An example is: A/B testing for engineers. The goal is to undo some of the cruft and convention that has built up over the years by both proper (but telegraphic) and improper […]

## Unrolling the ROC Curve

Most readers of this blog are likely familiar with the use of the ROC (Receiver Operating Characteristic) curve (or, at least, the area under that curve) for evaluating the quality of binary decision processes. One example of such a process is a binary classification model; another example is an A/B […]

## A/B Tests for Engineers

Introduction I’d like to discuss a simple variation of A/B testing in an engineering style. By “an engineering style” I mean: We will work a simulated example to see that the system works as claimed. We will exhibit examples of problems before trying to fix them. We will demonstrate all […]

## Record Shaping with CData

Introduction In many data science projects we have the data, but it “is in the wrong format.” Fortunately re-formatting or reshaping data is a solved problem, with many different available tools. For this note, I would like to show how to reshape data using the data algebra‘s cdata data reshaping […]

## Including ggplot2 Plots in Python Notebooks

For an article on A/B testing that I am preparing, I asked my partner Dr. Nina Zumel if she could do me a favor and write some code to produce the diagrams. She prepared an excellent parameterized diagram generator. However being the author of the book Practical Data Science with […]

## Schemas for Python Data Frames

The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can […]

## Solving for Hidden Data

Introduction Let’s continue along the lines discussed in Omitted Variable Effects in Logistic Regression. The issue is as follows. For logistic regression, omitted variables cause parameter estimation bias. This is true even for independent variables, which is not the case for more familiar linear regression. This is a known problem […]

## Detecting Data Differences Using the Sphering Transform

Many people who work with data are familiar with Principal Components Analysis (PCA): it’s a linear transformation technique that’s commonly used for dimension reduction, as well as for the orthogonalization of data prior to downstream modeling or analysis. In this article, we’ll talk about another PCA-style transformation: the sphering or […]