One of the great conveniences of performing a data science style analysis using Jupyter is that Jupyter notebooks are literate containers that combine code, text, results, and graphs. This is also one of the pain points in working with Jupyter notebooks with partners or with source control. That is: Jupyter […]
Estimated reading time: 5 minutes
Taking a break from weekend’s Elden Ring gaming to work out the probability of winning a tournament. The article can be found here: Some Math Inspired by Losing in Elden Ring. It is a variation on a “persuasion by calculation of examples” style I am working on.
Estimated reading time: 23 seconds
Introduction The data algebra is a Python system for designing data transformations that can be used in Pandas or SQL. The new 1.3.0 version introduces a lot of early checking and warnings to make designing data transforms more convenient and safer. An Example I’d like to demonstrate some of these […]
Estimated reading time: 12 minutes
Introduction A surprisingly tricky problem in doing data science or analytics in the database are situations where one has to re-map a large number of columns. This occurs, for example, in the vtreat data preparation system. In the vtreat case, a large number of the variable encodings reduce to table-lookup […]
Estimated reading time: 15 minutes
Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]
Estimated reading time: 11 minutes
Introduction Professor Sourav Chatterjee recently published a new coefficient of correlation called XICOR (refs: JASA, R package, Arxiv, Hacker News, and a Python package (different author)). The basic formula (in the tie-free case) is: Take X and Y as n-vectors of observations of random variable. Compute the ranks r(i) of […]
Estimated reading time: 6 minutes
The data algebra is a system for specifying data transformations in Pandas or SQL databases. To use it, we advise checking out the README and introduction. These document what data operators are the basis of data algebra transformation construction and composition. I have now added a catalog of what expression […]
Estimated reading time: 54 seconds
Machine learning “in the database” (including systems such as Spark) is an increasingly popular topic. And where there is machine learning, there is a need for data preparation. Many machine learning algorithms expect all data to be numeric without missing values. vtreat is a package (available for Python or for […]
Estimated reading time: 8 minutes
When working with multiple data tables we often need to know how for a given set of keys, how many instances of rows each table has. I would like to use such an example in Python as yet another introduction to the data algebra (an alternative to direct Pandas or […]
Estimated reading time: 8 minutes