## Schemas for Python Data Frames

The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can […]

## Detecting Data Differences Using the Sphering Transform

Many people who work with data are familiar with Principal Components Analysis (PCA): it’s a linear transformation technique that’s commonly used for dimension reduction, as well as for the orthogonalization of data prior to downstream modeling or analysis. In this article, we’ll talk about another PCA-style transformation: the sphering or […]

## Some of the Perils of Time Series Forecasting

I’ve recently released a couple of articles on time series forecasting that I want to re-share: A Time Series Apologia Forecasting in Aggregate Versus in Detail Roughly I am trying to point out alternatives to rushing to ARIMA without trying additional methods. ARIMA is great at handing the issues of […]

## A Time Series Apologia

I would like to share a new article on some of the methods and pitfalls of time series forecasting: “A Time Series Apologia”. In it I work the seemingly simple problem of forecasting a noisy copy of sin(t). The purpose of the article is to demonstrate using ARIMA methods, and […]

## A Pandas/Polars Rosetta Stone

Dr. Nina Zumel just shared a nice Pandas/Polars Rosetta Stone. She has a list of the common needed data wrangling operations, and how they are realized in Pandas and Polars. This can help with the data wrangling in your projects. Please check it out!

## Doing Better than the Average

The standard way to estimate the an expected value of a population from a sample of values v1 … vn is to compute the average (1/n) sumi = 1…nvi. It is well known in statistics that for grouped data, there are other estimators that can have smaller expected square error. […]

## Method Warnings

Introduction The data algebra is a Python system for designing data transformations that can be used in Pandas or SQL. The new 1.3.0 version introduces a lot of early checking and warnings to make designing data transforms more convenient and safer. An Example I’d like to demonstrate some of these […]

## How to Re-Map Many Columns in a Database

Introduction A surprisingly tricky problem in doing data science or analytics in the database are situations where one has to re-map a large number of columns. This occurs, for example, in the vtreat data preparation system. In the vtreat case, a large number of the variable encodings reduce to table-lookup […]

## Exploring the XI Correlation Coefficient

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, $$\xi$$ (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The $$\xi$$ coefficient has the following properties: If $$y$$ is a function of $$x$$, then $$\xi$$ goes to 1 asymptotically as $$n$$ […]

## Data Algebra Method Catalog

The data algebra is a system for specifying data transformations in Pandas or SQL databases. To use it, we advise checking out the README and introduction. These document what data operators are the basis of data algebra transformation construction and composition. I have now added a catalog of what expression […]