Our “Markov Chains leading up to A/B tests” series continues with The Biased Drunkard’s Walk. In this note we use the theory of Toeplitz matrices to analyze a variant of the drunkard’s walk that I am calling a “first to w tournament.” We can get not only the probability of […]
Estimated reading time: 58 seconds
I continue my series on the mathematics of Markov chains with a deep dive on The Drunkard’s Walk. This is a set-up for more on Wald’s Sequential Analysis (a near relative of A/B Tests). A great thing is: it explains some of the perceptual effects in Nina Zumel’s animations in […]
Estimated reading time: 29 seconds
I’ve been trying to write directly about some of the aspects of controlled experiments and A/B testing recently. An example is: A/B testing for engineers. The goal is to undo some of the cruft and convention that has built up over the years by both proper (but telegraphic) and improper […]
Estimated reading time: 4 minutes
Most readers of this blog are likely familiar with the use of the ROC (Receiver Operating Characteristic) curve (or, at least, the area under that curve) for evaluating the quality of binary decision processes. One example of such a process is a binary classification model; another example is an A/B […]
Estimated reading time: 7 minutes
Introduction I’d like to discuss a simple variation of A/B testing in an engineering style. By “an engineering style” I mean: We will work a simulated example to see that the system works as claimed. We will exhibit examples of problems before trying to fix them. We will demonstrate all […]
Estimated reading time: 46 minutes
Introduction In many data science projects we have the data, but it “is in the wrong format.” Fortunately re-formatting or reshaping data is a solved problem, with many different available tools. For this note, I would like to show how to reshape data using the data algebra‘s cdata data reshaping […]
Estimated reading time: 20 minutes
For an article on A/B testing that I am preparing, I asked my partner Dr. Nina Zumel if she could do me a favor and write some code to produce the diagrams. She prepared an excellent parameterized diagram generator. However being the author of the book Practical Data Science with […]
Estimated reading time: 5 minutes
The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can […]
Estimated reading time: 14 minutes
Introduction Let’s continue along the lines discussed in Omitted Variable Effects in Logistic Regression. The issue is as follows. For logistic regression, omitted variables cause parameter estimation bias. This is true even for independent variables, which is not the case for more familiar linear regression. This is a known problem […]
Estimated reading time: 34 minutes
Many people who work with data are familiar with Principal Components Analysis (PCA): it’s a linear transformation technique that’s commonly used for dimension reduction, as well as for the orthogonalization of data prior to downstream modeling or analysis. In this article, we’ll talk about another PCA-style transformation: the sphering or […]
Estimated reading time: 11 minutes