I continue my series on the mathematics of Markov chains with a deep dive on The Drunkard’s Walk. This is a set-up for more on Wald’s Sequential Analysis (a near relative of A/B Tests). A great thing is: it explains some of the perceptual effects in Nina Zumel’s animations in […]

Estimated reading time: 29 seconds

A lot is going on with current large language models (LLMs). Some of it is quite impactful, and some of it is the usual venture capital game (“we don’t stretch the truth, but we benefit from funding those who stretch the truth”). One way to cut through this is: consider […]

Estimated reading time: 4 minutes

I recently gave a talk for a general audience on the current state of large language models such as ChatGPT. Such models are substantial projects that are already greatly disrupting many jobs and industries. I was honestly impressed by the answers for the programming and language questions I asked it. […]

Estimated reading time: 6 minutes

I’ve been trying to write directly about some of the aspects of controlled experiments and A/B testing recently. An example is: A/B testing for engineers. The goal is to undo some of the cruft and convention that has built up over the years by both proper (but telegraphic) and improper […]

Estimated reading time: 4 minutes

Most readers of this blog are likely familiar with the use of the ROC (Receiver Operating Characteristic) curve (or, at least, the area under that curve) for evaluating the quality of binary decision processes. One example of such a process is a binary classification model; another example is an A/B […]

Estimated reading time: 7 minutes

Introduction I’d like to discuss a simple variation of A/B testing in an engineering style. By “an engineering style” I mean: We will work a simulated example to see that the system works as claimed. We will exhibit examples of problems before trying to fix them. We will demonstrate all […]

Estimated reading time: 46 minutes

Introduction In many data science projects we have the data, but it “is in the wrong format.” Fortunately re-formatting or reshaping data is a solved problem, with many different available tools. For this note, I would like to show how to reshape data using the data algebra‘s cdata data reshaping […]

Estimated reading time: 20 minutes

For an article on A/B testing that I am preparing, I asked my partner Dr. Nina Zumel if she could do me a favor and write some code to produce the diagrams. She prepared an excellent parameterized diagram generator. However being the author of the book Practical Data Science with […]

Estimated reading time: 5 minutes

The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can […]

Estimated reading time: 14 minutes

Introduction Let’s continue along the lines discussed in Omitted Variable Effects in Logistic Regression. The issue is as follows. For logistic regression, omitted variables cause parameter estimation bias. This is true even for independent variables, which is not the case for more familiar linear regression. This is a known problem […]

Estimated reading time: 34 minutes