I am pleased to announce the 0.9.0 release of the data algebra. The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include […]
I’ve been writing a lot about a category theory interpretations of data-processing pipelines and some of the improvements we feel it is driving in both the data_algebra and in rquery/rqdatatable. I think I’ve found an even better category theory re-formulation of the package, which I will describe here.
In our recent note What is new for rquery December 2019 we mentioned an ugly processing pipeline that translates into SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the data_algebra.
This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”). The advantages of data_algebra and cdata are: The […]
To make getting started with rquery (an advanced query generator for R) easier we have re-worked the package README for various data-sources (including SparkR!).
My BARUG rquery talk went very well, thank you very much to the attendees for being an attentive and generous audience. (John teaching rquery at BARUG, photo credit: Timothy Liu) I am now looking for invitations to give a streamlined version of this talk privately to groups using R who […]
I would like to thank LinkedIn for letting me speak with some of their data scientists and analysts. John Mount discussing rquery SQL generation at LinkedIn. If you have a group using R at database or Spark scale, please reach out ( jmount at win-vector.com ). We (Win-Vector LLC) have […]
Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings.
This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).
I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join: Please read on for my discussion of this diagram and teaching […]