The Pandas data frame is probably the most popular tool used to model tabular data in Python. For in-memory data, Pandas serves a role that might normally fall to a relational database. Though, Pandas data frames are typically manipulated through methods, instead of with a relational query language. One can […]

Estimated reading time: 14 minutes

Many people who work with data are familiar with Principal Components Analysis (PCA): it’s a linear transformation technique that’s commonly used for dimension reduction, as well as for the orthogonalization of data prior to downstream modeling or analysis. In this article, we’ll talk about another PCA-style transformation: the sphering or […]

Estimated reading time: 11 minutes

I’ve recently released a couple of articles on time series forecasting that I want to re-share: A Time Series Apologia Forecasting in Aggregate Versus in Detail Roughly I am trying to point out alternatives to rushing to ARIMA without trying additional methods. ARIMA is great at handing the issues of […]

Estimated reading time: 1 minute

I would like to share a new article on some of the methods and pitfalls of time series forecasting: “A Time Series Apologia”. In it I work the seemingly simple problem of forecasting a noisy copy of sin(t). The purpose of the article is to demonstrate using ARIMA methods, and […]

Estimated reading time: 42 seconds

Dr. Nina Zumel just shared a nice Pandas/Polars Rosetta Stone. She has a list of the common needed data wrangling operations, and how they are realized in Pandas and Polars. This can help with the data wrangling in your projects. Please check it out!

Estimated reading time: 22 seconds

The standard way to estimate the an expected value of a population from a sample of values v1 … vn is to compute the average (1/n) sumi = 1…nvi. It is well known in statistics that for grouped data, there are other estimators that can have smaller expected square error. […]

Estimated reading time: 41 seconds

Introduction The data algebra is a Python system for designing data transformations that can be used in Pandas or SQL. The new 1.3.0 version introduces a lot of early checking and warnings to make designing data transforms more convenient and safer. An Example I’d like to demonstrate some of these […]

Estimated reading time: 12 minutes

Introduction A surprisingly tricky problem in doing data science or analytics in the database are situations where one has to re-map a large number of columns. This occurs, for example, in the vtreat data preparation system. In the vtreat case, a large number of the variable encodings reduce to table-lookup […]

Estimated reading time: 15 minutes

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]

Estimated reading time: 11 minutes

The data algebra is a system for specifying data transformations in Pandas or SQL databases. To use it, we advise checking out the README and introduction. These document what data operators are the basis of data algebra transformation construction and composition. I have now added a catalog of what expression […]

Estimated reading time: 54 seconds