Menu Home

dplyr 0.7 Made Simpler

I have been writing a lot (too much) on the R topics dplyr/rlang/tidyeval lately. The reason is: major changes were recently announced. If you are going to use dplyr well and correctly going forward you may need to understand some of the new issues (if you don’t use dplyr you can safely skip all of this). I am trying to work out (publicly) how to best incorporate the new methods into:

  • real world analyses,
  • reusable packages,
  • and teaching materials.

I think some of the apparent discomfort on my part comes from my feeling that dplyr never really gave standard evaluation (SE) a fair chance. In my opinion: dplyr is based strongly on non-standard evaluation (NSE, originally through lazyeval and now through rlang/tidyeval) more by the taste and choice than by actual analyst benefit or need. dplyr isn’t my package, so it isn’t my choice to make; but I can still have an informed opinion, which I will discuss below.

dplyr itself is a very powerful collection of useful data analysis methods or "verbs." In some sense it is a fairly pure expression of how you organize data transformations in functional programming terms. (By the way: data.table is probably an equally fundamental powerful formation in object oriented terms.)

In my opinion there are only two places where dplyr truly benefits from or actually needs the (often complicated and confusing) full power of non-standard evaluation: in the dplyr::mutate() and dplyr::summarize() verbs.

I admit: a system that can’t accept an arbitrary functions or expressions from the user lacks expressive power. However, the only place you truly need this power is when creating a new derived column in a data.frame. If you can do this then you can drive all of the other important data wrangling functions (row selection, row ordering, grouping, joining, and so on).

When I teach R, I teach you are going to have to copy your data at some point. You are fighting the R language if you try to completely avoid copying as you would in other more reference oriented languages. This is likely one of the reasons Nathan Stephens and Garrett Grolemund define "Big Data" as:

Big Data ~ ≥ 1/3 RAM.

Once you accept you are going to make copies (which is not part of all systems, but in my opinion is a part of R) then you should take advantage of the fact you are going to make copies. In particular you should land, materialize, or reify the results of complicated user expressions as actual data columns (i.e., propagate data forward, not propagate code forward). Doing this wastes some space, but can actually be easier to parallelize, potentially faster, easier to document, and much easier to debug.

There is no reason to shun code of the form:

suppressPackageStartupMessages(library("dplyr"))

starwars %>% 
  mutate( want_row = height > mass ) %>%
  filter( want_row ) %>%
  select( -want_row )
## # A tibble: 58 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 48 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

And say you really need to write the more succinct:

starwars %>% 
  filter( height > mass )
## # A tibble: 58 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 48 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

The first form doesn’t waste much space (it adds a single new column among many) and is much easier to characterize and debug. By landing our filter criteria in a column it becomes data. Data is something we can reason about and process:

starwars %>% 
  mutate( want_row = height > mass ) %>%
  group_by( want_row ) %>% 
  summarize( count = n() )
## # A tibble: 3 x 2
##   want_row count
##      <lgl> <int>
## 1    FALSE     1
## 2     TRUE    58
## 3       NA    28

To help demonstrate and explore the expressive power of standard evaluation interfaces I am distributing a new small R package called seplyr (standard evaluation dplyr). seplyr is based on dplyr/rlang/tidyeval and is a thin wrapper that exposes equivalent standard evaluation interfaces for some of the more fundamental dplyr verbs ( group_by(), arrange(), rename(), select(), and distinct() ) and adds some of its own advanced verbs. It is similar to dplyr‘s now-deprecated "SE verbs", but with a more array and list oriented interface (de-emphasizing use of "..." in function arguments).

For example, we can take some of the code from the dplyr 0.7.0 announcement:

my_var <- quo(homeworld)
# or my_var <- rlang::sym("homeworld")

starwars %>%
  group_by(!!my_var) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)
## # A tibble: 49 x 3
##         homeworld   height  mass
##             <chr>    <dbl> <dbl>
##  1       Alderaan 176.3333  64.0
##  2    Aleen Minor  79.0000  15.0
##  3         Bespin 175.0000  79.0
##  4     Bestine IV 180.0000 110.0
##  5 Cato Neimoidia 191.0000  90.0
##  6          Cerea 198.0000  82.0
##  7       Champala 196.0000   NaN
##  8      Chandrila 150.0000   NaN
##  9   Concord Dawn 183.0000  79.0
## 10       Corellia 175.0000  78.5
## # ... with 39 more rows

And translate it into standard evaluation verbs:

# install.packages("seplyr")
library("seplyr")

my_var <- "homeworld"
summary_vars <- c("height", "mass")

starwars %>%
  select_se( c(my_var, summary_vars) ) %>%
  group_by_se( my_var ) %>%
  summarise_all( mean, na.rm = TRUE )
## # A tibble: 49 x 3
##         homeworld   height  mass
##             <chr>    <dbl> <dbl>
##  1       Alderaan 176.3333  64.0
##  2    Aleen Minor  79.0000  15.0
##  3         Bespin 175.0000  79.0
##  4     Bestine IV 180.0000 110.0
##  5 Cato Neimoidia 191.0000  90.0
##  6          Cerea 198.0000  82.0
##  7       Champala 196.0000   NaN
##  8      Chandrila 150.0000   NaN
##  9   Concord Dawn 183.0000  79.0
## 10       Corellia 175.0000  78.5
## # ... with 39 more rows

This standard evaluation interface isn’t so much a "more limited" version of dplyr, but a "more disciplined" approach to working with dplyr. We are using rlang/tidyeval, but that doesn’t mean the user has to see the rlang/tidyeval internals at all times.

For the most part we are passing work to dplyr using very small (and clear) functions. You can see how to use the new dplyr/rlang/tidyeval methods by printing the source code (for example: print(group_by_se)).

Also, in the development version of seplyr we are building up some exciting "complex standard evaluation verbs" such as add_group_indices() and add_group_sub_indices() which are best explained through their own documentation or an example:

# devtools::install_github('WinVector/seplyr')
library("seplyr")
groupingVars = c("cyl", "gear")

datasets::mtcars %>%
  tibble::rownames_to_column('CarName') %>%
  select_se(c('CarName', 'cyl', 'gear', 'hp', 'wt')) %>%
  add_group_indices(groupingVars = groupingVars,
                    indexColumn = 'groupID') %>%
  add_group_sub_indices(groupingVars = groupingVars,
                       arrangeTerms = c('desc(hp)', 'wt'),
                       orderColumn = 'orderInGroup') %>%
  arrange_se(c('groupID', 'orderInGroup'))
## # A tibble: 32 x 7
##           CarName   cyl  gear    hp    wt groupID orderInGroup
##             <chr> <dbl> <dbl> <dbl> <dbl>   <dbl>        <dbl>
##  1  Toyota Corona     4     3    97 2.465       1            1
##  2     Volvo 142E     4     4   109 2.780       2            1
##  3       Merc 230     4     4    95 3.150       2            2
##  4     Datsun 710     4     4    93 2.320       2            3
##  5      Fiat X1-9     4     4    66 1.935       2            4
##  6       Fiat 128     4     4    66 2.200       2            5
##  7 Toyota Corolla     4     4    65 1.835       2            6
##  8      Merc 240D     4     4    62 3.190       2            7
##  9    Honda Civic     4     4    52 1.615       2            8
## 10   Lotus Europa     4     5   113 1.513       3            1
## # ... with 22 more rows

Categories: Opinion Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

12 replies

  1. > “By the way: data.table is probably an equally fundamental powerful formation in object oriented terms.”

    Can you elaborate on this side note? I am genuinely interested since I’ve never considered such a fundamental/paradigm difference between the two packages (dplyr and data.table).

    1. I may be over-stating it. But roughly functional programming is “the land of the verbs” (functions being the most important thing) and object oriented programming is “the land of the nouns” (objects being the most important thing).

      So in a functional programming world you eventually come to think of a multi-step data analysis process as composition of functions. Ideas like Currying, lazy evaluation, and pipelines let you write the function composition in a more pleasant manner. Because so much of the emphasis is on composition and notation one thinks in those terms in functional programming.

      In an object oriented world you eventually come to think of data as an abstract object that you send a series of commands or messages to in sequence. So processing sequences are sequences of method calls (sequenced, not composed or nested) and the important thing is the container that holds the data. data.table embodies this as it builds a new container for data that has very powerful query, join, and aggregation built into it. It its notation you send commands to the data.table object in a sequence of steps (an older meaning of the word “program” as a schedule) and this notation is no way inferior to pipes or composition, as it is how one things in imperative object oriented terms.

      Now the attempt at “oil on the water” is:

      To understand computations in R, two slogans are helpful:

      Everything that exists is an object.
      Everything that happens is a function call.

      — John Chambers

      That is both functional programming and and object oriented programming are expected to be first-class citizens in R. So it does not make sense to assume the methodology of only one of these and use it to criticize the other (each looks cumbersome in the other’s notation).

      I’ve been reading and thinking on this a lot lately (so that is why this is so wordy). But roughly functional programming and imperative object oriented programming eventually meet in the middle (start looking a bit more like each other once they each add meta-tools and adapters). They mostly differ in what concerns are paramount or come first.

      For some fun discussions on this sort of thing you can try searching on phrases such as “objects are poor-mans closures” and “closures are poor-man’s objects.”

      1. Thank you, that’s an interesting point. Now I see what you mean and I agree.

        Data.table is a different beast where everything is mixed together in the name of performence. In fact, data.tables are even mutable objects after all.

        Disclaimer: I am a heavy data.table user who haven’t seen enough reasons to embrace tidyverse yet. To me, it all seems nice only for textbook examples. One have to get his hands dirty when working on real-world (biggish) data.

      2. I’m also a heavy data.table user, and I’ve had to use the tidyverse when working on a coworker’s code (sticking to his wheelhouse seems the polite thing to do). I’ve come to appreciate dplyr and tidyr for hammering awkward data into a useful shape, but after that everything’s done with data.table.

        P.S.: I totally agree with John that standard evaluation is undervalued. One reason people like the tidyverse is because it frees the mind from a lot of the technical burden. But requiring an understanding of environments, promises, and the new “quosure” class seems like an equally troublesome technical burden.

        I only use character vectors for the “on” and “by” arguments with data.tables for the same reason.

  2. I have fixed up the seplyr documentation a bit, in particular showing where it does and does not differ from dplyr::*_at() and dplyr::*_() methods. Frankly a good part of the difference is: I like standard evaluation (SE), so I will maintain, document, and extend the seplyr::*_se() methods to prove they are powerful enough for real work.

    Please check it out here.

    Also, I ignorantly used exactly the two verbs that now have new (they were not in `dplyr` `0.5.0`) equivalent dplyr::_at() forms in this article (those being: dplyr::select_at() and dplyr::group_by_at()). seplyr::arrange_se() and seplyr::rename_se() are much better examples as they work very differently (being more expressive attempts at a standard evaluation interface) than the new dplyr::*_at() forms. These equivalencies are more the exception than the rule, so it is worth using seplyr.

  3. I wonder if there is a niche for too many variants of dplyr e.g., the seplyr and so on. One only have so many frameworks for handling data after all. We already have the base-R (e.g., stats:reshape and so on), tidyverse (completely different syntax) and data.table (useful for larger data sets). Keeping up with these is enough work, on top of all the stats, machine learning, visualization and subject domain-specific packages. I am a bioinformatician myself and there are lots of new methods and new technologies coming out all the time. My next avenue (after consolidating tidyverse) is probably to see if I need to learn spark, and to see how to operate R in batch mode on big tin iron. But I doubt I will have time for alrternate versions of well-established packages.

    1. Actually one of the things that sent me down this rabbit hole was working out how to get dplyr to work reliably with Apache Spark via sparklyr (which now pretty much forces you onto the development version of dplyr). I taught the course here, but it took a lot of prep work.

      Please try to think of seplyr not as “yet another dplyr-like package”, but as notes on working with dplyr itself that happen to be in executable package form.

  4. I’ve been following this and your other blog entries with interest, as I’m actually feeling a little demoralised by the latest dplyr and reluctantly starting to investigate python and Rpy. My question is, you mention that summarise, mutate will not get the _se treatment. Can you give an example of how to integrate _se with these verbs. A non-working example is below. I understand you have developed special verbs to simplify this specific example of groupmeans, but I’m interested in the general case, if that makes sense.

    library(seplyr)
    
    breakdown = function(df, groupvar, measurevar) {
      res = df %>% select(groupvar, measurevar) %>%
        filter(complete.cases(.)) %>%
        group_by_se(groupvar) %>%
        summarise(groupmean = mean(MAGIC_HERE(measurevar)))
    }
    
    breakdown(starwars, 'eye_color', 'mass')
    
    1. I myself am pretty frustrated where things have been going. I think I speak for a lot of us when I say: at some point one wants to stop messing with notation and do some data analysis. I have also reversed my opinion (and edited the comment) and added mutate_se() and summarize_se() (the README, package vignettes, and help have a lot of text on this).

      That being said, you phrased your question very well and I happen to know the answer (here is me in a very long issue chain begging for the answer). I’ll paraphrase the answer for you here so you don’t have to repeat my experience.

      It turns out what you want to do (which I consider a dead-center primary use case for dplyr) requires using a method that is not currently re-exported by dplyr (making one wonder how a hard such uses cases were thought about during design).

      The missing “MAGIC” step is “!!rlang::sym()” (and I am not a huge fan of magic when it comes to teaching and documentation).

      Below is your code re-worked in pure dplyr and then again with some seplyr variations (I am not trying to trick anyone into using seplyr).

      suppressPackageStartupMessages(library("dplyr"))
      
      breakdown = function(df, groupvar, measurevar) {
        df %>% 
          select(groupvar, measurevar) %>%
          filter(complete.cases(.)) %>%
          group_by_at(groupvar) %>%
          summarise(groupmean = mean(!!rlang::sym(measurevar)))
      }
      
      breakdown(starwars, 'eye_color', 'mass')
      
      library(seplyr)
      
      breakdown_se = function(df, groupvar, measurevar) {
        df %>% 
          select_se(c(groupvar, measurevar)) %>%
          filter(complete.cases(.)) %>%
          group_by_se(groupvar) %>%
          summarize_at(measurevar, funs(mean)) %>%
          rename_se(c(groupmean = measurevar))
      }
      
      breakdown_se(starwars, 'eye_color', 'mass')
      

      There are some other variations (as.name(), rlang::parse_expr(); my experience is: no matter which one you pick you will at some point be told one of the others is the preferred notation).

      Or if you want to avoid the notational issues entirely, and learn one technique that lets you write legible parameterized code in many situations, please try our wrapr package:

      suppressPackageStartupMessages(library("dplyr"))
      library("wrapr")
      
      breakdown = function(df, groupvar, measurevar) {
        let(
          c(GROUPVAR = groupvar,
            MEASUREVAR = measurevar,
            GROUPRESULT = paste0('mean_', measurevar)),
          df %>% 
            select(GROUPVAR, MEASUREVAR) %>%
            filter(complete.cases(.)) %>%
            group_by(GROUPVAR) %>%
            summarise(GROUPRESULT = mean(MEASUREVAR)))
      }
      
      breakdown(starwars, 'eye_color', 'mass')
      
  5. I have a dumb question – what is standard evaluation? I googled a bit and found it surprisingly difficult to find a definition.

    Does SE mean we only work with strings and functions, and every name that appears in the code represents a value that has been defined somewhere along the standard search path? For example, if I see “mass” somewhere in the code, I can be sure the name mass has been bound to a value either in the containing scope, or perhaps farther up in R’s standard search path?

    Or could SE be taken as allowing for things like quosures? In that case, things like quo(mass) violate the rule above, but you can at least see in the source that they are special objects that will be evaluated differently. This is almost completely referentially transparent – the only exception, the contents of a quo(…) expression, is at least clear and explicit.

    1. Good question. “Standard Evaluation” is not a standard term (sorry!). It is just a convenient opposite of “Non-Standard Evaluation” (a term used in R for a long time).

      Roughly the correct concept is “referential transparency in evaluation”. Most languages work very hard to work this way, so it is “standard.” Referential transparency roughly means a program works the same if a given variable is replaced with the value it happens to hold. This concept can also be applied to code as a critique of macro substitutions (which are referentially transparent if the code works the same before and after the substitution). I have been struggling with what to call this dithering between “standard evaluation” and “parametric evaluation” (variables hold values or parameters).

      Non-standard evaluation in R means code that looks at code. Like all ideas it starts out harmless (plotting functions capturing the names of their arguments for displaying on the axis as in: ”
      plot(x=1:10, y = sin(1:10))“). But deep down non-standard evaluation is capturing the un-evaluated parse tree of code and allowing functions to make arbitrary decisions based on peeking at the source code. Very powerful, and therefore a bit hard to reason about and bound.

      So to me SE (standard evaluation) means avoiding things like hidden or implicit capturing source code or quosures. Using strings (which is my hack to avoid this) is “standard” in that the user knows they are typing in a string (the quote marks) and then the program later converts this to code. Such re-writing is indeed powerful, but it is at least somewhat limited in that both the user and the interpreter know which subset of the work is being looked at (a quosures is also such a “something special is going on here” marker, but it differs in that you can routinely produce quosures by capturing user code).

      Or a little more back to your description: SE means computing only over values.

      Obviously some of this is a bit of “how many angels fit on the heads of pins”, (the distinction is a bit fine). In fact it is a deep result in computer science and function theory that users can write in a style where functions do have access to their own source-code in any non-trivial language, even if the language designers try to prevent it (please see Quine (computing)). This inability to assume you can hide source code is central to proofs of fundamental difficulties of programming (such as the un-deciablity of the halting problem in non-trivial programming environments).

      Or (sorry about the length!): code that is explicitly aware of its own source code brings front and center fundamental difficulties in programming that one would like to avoid. Using a slightly more restrictive style doesn’t avoid the problems, but lets us say “well that problem wasn’t there until the user went out of their way to make trouble.”

%d bloggers like this: