Menu Home

Please Consider Using wrapr::let() for Replacement Tasks

From dplyr issue 2916.

The following appears to work.

suppressPackageStartupMessages(library("dplyr"))

COL <- "homeworld"
starwars %>%
  group_by(.data[[COL]]) %>%
  head(n=1)
## # A tibble: 1 x 14
## # Groups:   COL [1]
##             name height  mass hair_color skin_color eye_color birth_year
##            <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77      blond       fair      blue         19
## # ... with 7 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>, COL <chr>

Though notice it reports the grouping is by "COL", not by "homeworld". Also the data set now has 14 columns, not the original 13 from the starwars data set.

And this seemingly similar variation (currently) throws an exception:

homeworld <- "homeworld"

starwars %>%
  group_by(.data[[homeworld]]) %>% 
  head(n=1) 
## Error in mutate_impl(.data, dots): Evaluation error: Must subset with a string.

I know this will cost me what little community good-will I might have left (after already having raised this, unsolicited, many times), but please consider using our package wrapr::let() for tasks such as the above.

library("wrapr")

let(
  c(COL = "homeworld"),
  
  starwars %>%
    group_by(COL) %>%
    head(n=1)
)
## # A tibble: 1 x 13
## # Groups:   homeworld [1]
##             name height  mass hair_color skin_color eye_color birth_year
##            <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77      blond       fair      blue         19
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>
let(
  c(homeworld = "homeworld"),
  
  starwars %>%
    group_by(homeworld) %>% 
    head(n=1)
)
## # A tibble: 1 x 13
## # Groups:   homeworld [1]
##             name height  mass hair_color skin_color eye_color birth_year
##            <chr>  <int> <dbl>      <chr>      <chr>     <chr>      <dbl>
## 1 Luke Skywalker    172    77      blond       fair      blue         19
## # ... with 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Some explanation can be found here.

Categories: Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

5 replies

  1. Some variations in notation:

    library("dplyr")
    
    # We want to group by eye_color, 
    # see if using a deliberately confusing 
    # holder name interferes.
    homeworld <- "eye_color"
    
    # correctly groups by eye_color
    starwars %>% group_by(.data[[!!homeworld]])
    
    # error 
    starwars %>% group_by(.data[!!homeworld]) 
    
    # error 
    starwars %>% group_by(.data[[homeworld]])
    
    # error
    starwars %>% group_by(.data[homeworld])
    
    # groups by a NEW column called '"eye_color"' 
    #  (incorrect)
    starwars %>% group_by(!!homeworld)
    
    # groups by homeworld (as you would expect, 
    #  but we want eye_color)
    starwars %>% group_by(homeworld)
    
    # groups by homeworld (but we want eye_color)
    starwars %>% group_by(.data$homeworld)
    
    # error
    starwars %>% group_by(.data$!!homeworld)
    
    # error
    starwars %>% group_by(.data$(!!homeworld))
    

    The notations being most commonly taught by the package authors are: verb(!!columnvar), verb(.data[[columnvar]]), verb(.data$columnvar) (also here) and verb(.data[columnvar]). None of these are the working notation verb(.data[[!!columnvar]]).

    Note it is possible to remember the form that works (though it isn’t the one being commonly taught). First I already teach to always use [[]] where possible as it stricter than []. So lets assume we always remember to do that. Then just remember that you should always have a !! in rlang/tidyeval situations.

    The example comes from here which also suggests using the quo() notation (in addition to using string notation). That does appear to work:

    homeworld <- quo(eye_color)
    
    # correctly groups by eye_color
    starwars %>% group_by(!!homeworld)
    

    However that does not address the application I am actually interested in: wrapping a column name that comes from an external string (perhaps even coming from an external configuration file). In my applications I not only manipulate column names as strings, they ofter are first available in that form. I feel if you are close enough to write “`quo(eye_color)`” you are likely close enough to re-code the pipeline directly.

    To get the quo()-like notation to work in that case I assume you must do something like the following:

    # assume the value in varstr comes from very far away
    # such as a config file or database
    varstr <- 'eye_color'
    
    # here we try to wrap it into quote-type structures
    homeworld <- as.name(varstr)
    
    # correctly groups by eye_color
    starwars %>% group_by(!!homeworld)
    
  2. I do think the wrapr::let is clearer and easier to understand than dplyr approach. I have already used it in production code.

    Any performance test?

    1. Thanks! I really appreciate it. Our group will make sure wrapr::let() remains stable and production worthy.

      As for timings. Keeping in mind: both substitution systems should take very little time compared to any substantial calculation task, so should not be that important. But, it is a fun question. So I worked up a quick report here.

      The report compares 3 substitution methods: wrapr::let() (labeled as `fWrapr*`), `rlang::eval_tidy()` working from a name holding a string (closest to wrapr::let() in behavior, labeled as `fTidyN*`), and `rlang::eval_tidy()` working from a `quo()` symbol (the case the `rlang`/`tidyeval` package authors seem to discuss the most, labeled as `fTidyQ*`). I plotted the timing distributions a few ways and draw some conclusions while substituting 1 to 10 variables. Below is one of the plots (more context is given in the report):

  3. Thank you for the benchmark.

    Recently, I have been doing many data manipulations using a combination of dpyr + purrr + wrapr. When the project is finished, I think I could have some experience to discuss.

    Keep up the great work

  4. Hi John – I have also started using wrapr::let in analysis code. It really helps to lift the mental burden of constantly thinking about quoting, unquoting, strings vs. symbols, and other metaprogramming issues which are largely peripheral to getting things done.

    I’ve really tried to do it the idiomatic dplyr way out of respect for Mr. Wickham’s amazing work… but, as I’ve said before, I don’t like constantly thinking about metaprogramming. Things like quasiquotes may be “powerful” but I don’t want to be worrying about them with every line I write. If I did… I’d probably be a LISP programmer :)

%d bloggers like this: