Introduction A surprisingly tricky problem in doing data science or analytics in the database are situations where one has to re-map a large number of columns. This occurs, for example, in the vtreat data preparation system. In the vtreat case, a large number of the variable encodings reduce to table-lookup […]

Estimated reading time: 15 minutes

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]

Estimated reading time: 11 minutes

The data algebra is a system for specifying data transformations in Pandas or SQL databases. To use it, we advise checking out the README and introduction. These document what data operators are the basis of data algebra transformation construction and composition. I have now added a catalog of what expression […]

Estimated reading time: 54 seconds

Machine learning “in the database” (including systems such as Spark) is an increasingly popular topic. And where there is machine learning, there is a need for data preparation. Many machine learning algorithms expect all data to be numeric without missing values. vtreat is a package (available for Python or for […]

Estimated reading time: 8 minutes

When working with multiple data tables we often need to know how for a given set of keys, how many instances of rows each table has. I would like to use such an example in Python as yet another introduction to the data algebra (an alternative to direct Pandas or […]

Estimated reading time: 8 minutes

Nina Zumel and John Mount will be speaking at the online University of San Francisco Seminar Series in Data Science! How and why to use probability models to outperform decision rules Friday April 30, 2021 12:30pm – 2pm Pacific Time See here for full details and to RSVP In this […]

Estimated reading time: 58 seconds

I am trying a new idea: “data science bites.” Data science bites are small articles and videos explaining only one idea each. This first one explains what supervised machine learning is, without going into the details of how it is realized. (link)

Estimated reading time: 27 seconds

I have up what I think is a really neat tutorial on how to plot multiple curves on a graph in Python, using seaborn and data_algebra. It is great way to show some data shaping theory convenience functions we have developed. Please check it out.

Estimated reading time: 23 seconds

Introduction Teaching basic data science, machine learning, and statistics is great due to the questions. Students ask brilliant questions, as they see what holes are present in your presentation and scaffolding. The students are not yet conditioned to ask only what you feel is easy to answer or present. They […]

Estimated reading time: 23 minutes