In our recent note What is new for
rquery December 2019 we mentioned an ugly processing pipeline that translates into
SQL of varying size/quality depending on the query generator we use. In this note we try a near-relative of that query in the
dplyr translates the query to
SELECT 5.0 AS `x`, `sum23` FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 4.0 AS `x` FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 3.0 AS `x` FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 2.0 AS `x` FROM (SELECT `col1`, `col2`, `col3`, `sum23`, 1.0 AS `x` FROM (SELECT `col1`, `col2`, `col3`, `col2` + `col3` AS `sum23` FROM `d`)))))
rquery translates the query to
SELECT "x", "sum23" FROM ( SELECT "col2" + "col3" AS "sum23", 5 AS "x" FROM ( SELECT "col2", "col3" FROM "example_table" ) tsql_28722584463189084716_0000000000 ) tsql_28722584463189084716_0000000001
SQL doesn’t copy the column
col1 around, and also skips the dead-values assigned into
x. The query still has some waste: the inner and outer guard queries that are used to make
SQL look a bit more regular.
What I would like to add is our new note, showing what the
data_algebra translates a similar query into the following
SELECT 5 AS "x", "col2" + "col3" AS "sum23", "col3" FROM "d"
col3 as we asked for that column to be part of the result in the newer demonstration.) This new query has fewer unnecessary steps. The idea is one can code intent step-wise in a pipeline and still end up with a fairly compact and performant
SQL query in the end.
I think both
data_algebra can save quite a lot of development resources and machine time in data wrangling tasks.
Categories: Administrativia Computer Science Pragmatic Data Science
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.