A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
For how and why, please check out our new introductory article.
Note: now that wrapr
version 1.0.2 is up on CRAN all of the examples can be re-written without quotes using the qae()
operator (“quote assignment expression”). For example:
library("seplyr") #> Loading required package: wrapr packageVersion("wrapr") #> [1] '1.0.2' plan <- partition_mutate_se( qae(name := tolower(name), height := height + 0.5, height := floor(height), mass := mass + 0.5, mass := floor(mass))) print(plan) #> $group00001 #> name height mass #> "tolower(name)" "height + 0.5" "mass + 0.5" #> #> $group00002 #> height mass #> "floor(height)" "floor(mass)"
Categories: Administrativia Opinion Pragmatic Data Science
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Do you consider it a sustainable approach to use strings to represent expressions? I think seplyr is OK as a quick-and-dirty approach for part-time users (whom you’ve rightly advocated for in the past), but ultimately a system like tidyeval, with specialized expression objects and operations, is needed to build sophisticated and reliable functionality for manipulating data.
First, thanks for your comment. The nature of your concerns very much indicate you know what you are talking about.
You have a correct insight: the strings themselves are an unpleasant hack (with some unfortunate user-facing cost). However, I think the value-based interfaces are much more sustainable than what is currently
tidyeval
. For exampletidyeval
appears to be preparing to completely re-factor its parsing layer (to I imagine no longer matchR
‘s abstract syntax tree, something I think will have unexpected user-facing maintenance costs).I think value-oriented referentially transparent interfaces are much more reliable for writing functions. I also feel a lot more distinction has to be made as to what is one parsing in service of: manipulating
R
code, or translating to an external system (such asSQL
).My advice remains: if you have a choice, avoid
rlang/tidyeval
. The wayrlang/tidyeval
was put intodplyr
before having been publicly used feels like a strong effort to avoid leaving users a choice. I have no knowledge of actual intent, but that is my outsider’s view.I recommend whatever combination of
wrapr::let()
orseplyr
that works best for you. In particular now thatwrapr
1.0.2
is onCRAN
one can useqc()
,qe()
, andqae()
to avoid needing quotes even withseplyr
.Obviously many people do not agree with my personal opinion. And that is quite all-right, it is easy for them to avoid the
seplyr
andwrapr
packages if they wish to do so. Also we are careful to always only userlang/tidyeval
on client engagements unless they specifically ask otherwise (due to our own conflict of interest, so we have gotten a lot of experience with purerlang/tidyeval
projects).I have some designs and prototypes for non stop-gap “R and Big Data” solutions. But they are going to take longer to get fully into production.
Sorry that got long (and perhaps a bit more direct than is allowed), but I think there are a lot of important issues being swept over currently in the
R
community.These are all fair points. I’d summarize my current perspective as follows:
1. The philosophy of tidyeval feels right to me. I’m a big fan of quosures and quasiquotes because they can keep track of environments and lexical scoping issues, which I think will inevitably crop up in a large, complex codebase. You can tell R exactly where to look for the objects in a quosure using the quasiquoting system.
1a. That said, it’s possible the need for quosures and quasiquotes has been overblown. I have found myself using them less over the past year, because the scoped variants of the dplyr verbs cover most use cases that would otherwise need quasiquoting.
2. I can’t speak to the tidyeval implementation issues you mention, such as re-factoring the parsing layer. It does sound concerning. Enterprise data scientists are already writing code with these packages.
3. The tidyverse has always been less than mature when it comes to Big Data and database backends. I rarely use it except to toy around and see if things have improved since last attempt. It sounds like this has created problems for you, and you’re trying to develop a different system that serves your needs. That makes sense and I’m sure many others (including myself) will have a use for that system.
Thank Paul!
The system I am working on is going to be a combination of
cdata
(which clients have already been using successfully in production) and something that will incorporate learnings from myrquery
experiment (which is progressing very fast while teaching us what works well and what does not).