Introduction A surprisingly tricky problem in doing data science or analytics in the database are situations where one has to re-map a large number of columns. This occurs, for example, in the vtreat data preparation system. In the vtreat case, a large number of the variable encodings reduce to table-lookup […]

Estimated reading time: 15 minutes

We have found that for 2 by 2 confusion matrices (a common summary relating the relation between categorical variables) the expected value of the xicor coefficient of correlation specializes into the re-normalized square of the determinant! One can summarize how a 0/1 variable x relates to a 0/1 variable y […]

Estimated reading time: 1 minute

Nina Zumel Recently, we’ve been reading about a new correlation coefficient, \(\xi\) (“xi”), which was introduced by Professor Sourav Chatterjee in his paper, “A New Coefficient of Correlation”. The \(\xi\) coefficient has the following properties: If \(y\) is a function of \(x\), then \(\xi\) goes to 1 asymptotically as \(n\) […]

Estimated reading time: 11 minutes

The data algebra is a system for specifying data transformations in Pandas or SQL databases. To use it, we advise checking out the README and introduction. These document what data operators are the basis of data algebra transformation construction and composition. I have now added a catalog of what expression […]

Estimated reading time: 54 seconds

Machine learning “in the database” (including systems such as Spark) is an increasingly popular topic. And where there is machine learning, there is a need for data preparation. Many machine learning algorithms expect all data to be numeric without missing values. vtreat is a package (available for Python or for […]

Estimated reading time: 8 minutes

When working with multiple data tables we often need to know how for a given set of keys, how many instances of rows each table has. I would like to use such an example in Python as yet another introduction to the data algebra (an alternative to direct Pandas or […]

Estimated reading time: 8 minutes

I’d like to write a bit about measuring effect sizes and Cohen’s d. Introduction For our note let’s settle on a single simple example problem. We have two samples of real numbers a_1, …, a_n and b_1, …, b_n. All the a_i are mutually exchangeable or generated by an independent […]

Estimated reading time: 10 minutes

I am pleased to announce the 0.9.0 release of the data algebra. The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include […]

Estimated reading time: 1 minute

I would like to share another quick tutorial on some aspects of the data algebra, this time using the example of comparing two tables. Please check it out here.

Estimated reading time: 14 seconds

I have a new intermediate introduction on the data algebra up here: Using the data algebra for Statistics and Data Science. The data algebra is a tool for data processing in Python which is implemented on top of any of Pandas, Google BigQuery, PostgreSQL, MySQL, Spark, and SQLite. It allows […]

Estimated reading time: 37 seconds