Menu Home

Getting started with seplyr

A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.

Safety

For how and why, please check out our new introductory article.

Note: now that wrapr version 1.0.2 is up on CRAN all of the examples can be re-written without quotes using the qae() operator (“quote assignment expression”). For example:

library("seplyr")
#> Loading required package: wrapr
packageVersion("wrapr")
#> [1] '1.0.2'
plan <- partition_mutate_se(
  qae(name := tolower(name),
      height := height + 0.5,
      height := floor(height),
      mass := mass + 0.5,
      mass := floor(mass)))
print(plan)
#> $group00001
#>            name          height            mass 
#> "tolower(name)"  "height + 0.5"    "mass + 0.5" 
#> 
#> $group00002
#>          height            mass 
#> "floor(height)"   "floor(mass)"

Categories: Administrativia Opinion Pragmatic Data Science

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. Do you consider it a sustainable approach to use strings to represent expressions? I think seplyr is OK as a quick-and-dirty approach for part-time users (whom you’ve rightly advocated for in the past), but ultimately a system like tidyeval, with specialized expression objects and operations, is needed to build sophisticated and reliable functionality for manipulating data.

    1. First, thanks for your comment. The nature of your concerns very much indicate you know what you are talking about.

      You have a correct insight: the strings themselves are an unpleasant hack (with some unfortunate user-facing cost). However, I think the value-based interfaces are much more sustainable than what is currently tidyeval. For example tidyeval appears to be preparing to completely re-factor its parsing layer (to I imagine no longer match R‘s abstract syntax tree, something I think will have unexpected user-facing maintenance costs).

      I think value-oriented referentially transparent interfaces are much more reliable for writing functions. I also feel a lot more distinction has to be made as to what is one parsing in service of: manipulating R code, or translating to an external system (such as SQL).

      My advice remains: if you have a choice, avoid rlang/tidyeval. The way rlang/tidyeval was put into dplyr before having been publicly used feels like a strong effort to avoid leaving users a choice. I have no knowledge of actual intent, but that is my outsider’s view.

      I recommend whatever combination of wrapr::let() or seplyr that works best for you. In particular now that wrapr 1.0.2 is on CRAN one can use qc(), qe(), and qae() to avoid needing quotes even with seplyr.

      Obviously many people do not agree with my personal opinion. And that is quite all-right, it is easy for them to avoid the seplyr and wrapr packages if they wish to do so. Also we are careful to always only use rlang/tidyeval on client engagements unless they specifically ask otherwise (due to our own conflict of interest, so we have gotten a lot of experience with pure rlang/tidyeval projects).

      I have some designs and prototypes for non stop-gap “R and Big Data” solutions. But they are going to take longer to get fully into production.

      Sorry that got long (and perhaps a bit more direct than is allowed), but I think there are a lot of important issues being swept over currently in the R community.

      1. These are all fair points. I’d summarize my current perspective as follows:

        1. The philosophy of tidyeval feels right to me. I’m a big fan of quosures and quasiquotes because they can keep track of environments and lexical scoping issues, which I think will inevitably crop up in a large, complex codebase. You can tell R exactly where to look for the objects in a quosure using the quasiquoting system.

        1a. That said, it’s possible the need for quosures and quasiquotes has been overblown. I have found myself using them less over the past year, because the scoped variants of the dplyr verbs cover most use cases that would otherwise need quasiquoting.

        2. I can’t speak to the tidyeval implementation issues you mention, such as re-factoring the parsing layer. It does sound concerning. Enterprise data scientists are already writing code with these packages.

        3. The tidyverse has always been less than mature when it comes to Big Data and database backends. I rarely use it except to toy around and see if things have improved since last attempt. It sounds like this has created problems for you, and you’re trying to develop a different system that serves your needs. That makes sense and I’m sure many others (including myself) will have a use for that system.

      2. Thank Paul!

        The system I am working on is going to be a combination of cdata (which clients have already been using successfully in production) and something that will incorporate learnings from my rquery experiment (which is progressing very fast while teaching us what works well and what does not).

%d bloggers like this: