I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable. I think I’ve found an even better category theory re-formulation of the package, which I will describe here.

Estimated reading time: 12 minutes

In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra.

Estimated reading time: 2 minutes

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”). The advantages of data_algebra and cdata are: The […]

Estimated reading time: 17 minutes

To make getting started with rquery (an advanced query generator for R) easier we have re-worked the package README for various data-sources (including SparkR!).

Estimated reading time: 39 seconds

My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience. (John teaching rquery at BARUG, photo credit: Timothy Liu) I am now looking for invitations to give a streamlined version of this talk privately to groups using R who […]

Estimated reading time: 1 minute

I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts. John Mount discussing rquery SQL generation at LinkedIn. If you have a group using R at database or Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have […]

Estimated reading time: 49 seconds

Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings.

Estimated reading time: 5 minutes

This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).

Estimated reading time: 23 minutes

I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join: Please read on for my discussion of this diagram and teaching […]

Estimated reading time: 3 minutes

We have updated the errata for Practical Data Science with R to reflect that it is no longer worth the effort to use the Java version of SQLScrewdriver as described. We are very sorry for any confusion, trouble, or wasted effort bringing in Java software (something we are very familiar […]

Estimated reading time: 1 minute