Introduction I want to spend some time thinking out loud about linear regression. As a data science consultant and teacher I spend a lot of time using linear regression and teaching linear regression. I have found each of these pursuits can degenerate into mere doctrine or instructions. “do this,” “expect […]

## Short Data Science Video: Parameterized Juypter Notebooks

I am sharing a new short data science video: Parameterized Juypter Notebooks. It is an example from the wvpy package showing how to programmatically re-run the same notebook with many different inputs. If you are doing data science in Python, this may help you with your projects. link

## Yet Another Data Transform Tutorial

I am sharing yet another data transform tutorial here! It is about coordinatized data, the larger theory encompassing pivot and un-pivot. The example is in Python, but we also supply a similar package for R users.

## Data Algebra over Polars Ready for Production Use

The data algebra is a system for composing data manipulation tasks in Python. In the data algebra, operator pipelines (or even directed acyclic graphs) are the primary objects. Applying operations composes small data pipelines into larger ones. This allows the fluid specification, inspection, and sharing of data processing and data […]

## Touching the 3rd Rail of Data Science: “R or Python?”

I’ve been seeing a lot of hot takes on if one should do data science in R or in Python. I’ll comment generally on the topic, and then add my own myopic gear-head micro benchmark. I’ll jump in: If learning the language is the big step: then you are a […]

## Experimenting with Polars for Data in Python

I’ve just started experimenting with the Polars data frame library in Python. I really like the programmable API it exposes. In fact I am starting an experimental adapter from the data algebra to Polars. When this is complete one can use the data algebra to run the same data transform […]

## Data Science: Street Fighting Statistics

I am excited to share my guest lecture for Department of Statistics at the University of Illinois STAT 447: Data Science Programming Methods. And thank you to Dirk Eddelbuettel for inviting me! The talk was titled “Data Science: Street Fighting Statistics” and demonstrates two simple supervised modeling tasks in R. […]

## What a Data Engineer Needs to Know About Bitemporal Modeling

A central data science engineering problem is how to organize general data into columns for analysis. I often refer to this as denormalization, or the deliberate arranging of data so all entries of a record are in a single row in a single table. In this note I will write […]

## Y-Aware PCA

We have had some trouble with some articles being damaged or hard to access in the Win Vector blog. I (John Mount) do want to apologize for that. In particular the graphs are missing for Dr. Nina Zumel’s wonderful y-aware Pricipal Components regression series. The complete R .md and .Rmd […]

## wvpy Clean Up

Just a quick administrative note. To lower the number of dependencies in our Jupyter to Python converter (text and video tutorial here) I have moved the other data science tools (and their dependencies) out of the wvpy package and into a new package named wvu (“Win Vector University”). This will, […]