Menu Home

wrapr: for sweet R code

This article is on writing sweet R code using the wrapr package.

Wrapr

The problem

Consider the following R puzzle. You are given: a data.frame, the name of a column that you wish to find missing values (NA) in, and the name of a column to land the result. For instance:

d <- data.frame(x = c(1, NA))
print(d)

 #    x
 # 1  1
 # 2 NA

cname <- 'x'
print(cname)

 # [1] "x"

rname <- paste(cname, 'isNA', sep = '_')
print(rname)

 # [1] "x_isNA"

How do you write generic code to populate the column x_isNA with which rows of x are missing?

The “base R” solution

In “base R” (R without additional packages) this is easy.

When you know the column names while writing the code:

d2 <- d
d2$x_isNA <- is.na(d2$x)

print(d2)

 #    x x_isNA
 # 1  1  FALSE
 # 2 NA   TRUE

And when you don’t know the column names while writing the code (but know they will arrive in variables later):

d2 <- d
d2[[rname]] <- is.na(d2[[cname]])

The “base R” solution really is quite elegant.

The “all in” non-standard evaluation dplyr::mutate solution

As far as I can tell the “all in” non-standard evaluation dplyr::mutate solution is something like the following.

When you know the column names while writing the code:

library("dplyr")
d %>% mutate(x_isNA = is.na(x))

And when you don’t know the column names while writing the code (but know they will arrive in variables later):

d %>%
  mutate_(.dots =
            stats::setNames(list(lazyeval::interp(
              ~ is.na(VAR),
              VAR = as.name(cname)
            )),
            rname))

The sweet wrapr::let dplyr::mutate solution

We will only work the harder “when you don’t yet know the column name” (or parametric) version:

library("wrapr")
let(list(COL = cname, RES = rname),
    d %>% mutate(RES = is.na(COL))
)

I think that this is pretty sweet, and can really level up your dplyr game.

wrapr::let is available from CRAN and already has a number of satisfied users:

If function behavior depends on variable names, then convenient control of functions is eventually going to require convenient control of variable names; so needing to re-map variable names at some point is inevitable.

Categories: Coding Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

7 replies

  1. Another cool solution would be what I am calling a “view frame.” That is: a reference style object that looks to R like a data.frame (or any class that claims to extend it such as tbl) but re-maps column names to another referred to data.frame.

    I am not a regular data.table user, but this seems like something that package may already (or could easily) supply.

  2. Any R function or package that relies heavily on non-standard evaluation can benefit from parametric notation (such as introduced by wrapr::let). It isn’t just coding around things, but creating new capabilities (that are ready to be wrapped as re-usable functions). The more the system relies on non-standard evaluation, the larger the potential benefit (which is how I have been picking examples).

    For example:

    library("wrapr")
    
    angle <- 1:10
    var <- 'angle'
    fn <- 'sin'
    
    let(c(X=var, F=fn),
      plot(X, F(X))
    )
    

    wrap::let can also be used with knitr markdown which looks like the following:

    ---
    params:
        FN: sin
    ---
    ```{r}
    library("wrapr")
    let(
      alias=restrictToNameAssignments(params),
      expr={
        # blocks can be arbitrarily long
        x <- 0.1*(1:20)
        plot(FN(x))
      })
    ```
    

    The connection is: parameterized knitr converts the yaml header into the data structure params, which is already in the correct format for wrapr::let (the restrictToNameAssignments() call is just demonstrating the additional capability of filtering out non-name assignments, and is not strictly necessary).

    I also discuss parametric markdown in the following screencast:

    [youtube https://www.youtube.com/watch?v=iKLGxzzm9Hk?ecver=1&w=560&h=315%5D

    A lot of the power of R is being able to script and program over data and standard evaluation functions; being able to conveniently script and program over non-standard evaluation adds even more power.

  3. Nice! You can also use the standard evaluation version of mutate:

    mutate_(d, .dots = setNames(list(is.na(cname)), rname))

    1. That would be nice, but it does not work. I think what that is calculating is if the variable cname is a missing value or not (and not calculating facts about the data.frame column):

      library("dplyr")
      d <- data.frame(x = c(1, NA))
      cname <- 'x'
      rname <- paste(cname, 'isNA', sep = '_')
      mutate_(d, .dots = setNames(list(is.na(cname)), 
                                  rname))
                                  
         x x_isNA
      1  1  FALSE
      2 NA  FALSE
      

      It is interesting the list() delays execution (which was the latest improvement I learned about), but a few more tricks are needed to get the correct outcome (which is what was pointed out to me here).

      The article is already using the standard eval path, it is just so buried in the adaptions that it is hard to see the underbar.

      I have heard a few times (1, 2, 3) that big changes are coming to lazyeval and/or the standard interface paths in dplyr, but frankly that is just another reason to not waste time mastering the minutia of the current dplyr standard interface.

      Also if WordPress mangled out some important part of your solution, I do apologize (WordPress does not like code in comments very much).

  4. I like the idea of “let” bindings in R, but I will point out that there is a much easier way to apply functions across columns in dplyr: use “mutate_each”.

    library(dplyr)
    d %>%
        mutate_each(funs("isNA" = is.na(.)))
    
    1. Aaron,

      Thanks for the comment. And you are right I should have mentioned mutate_each (it is a great tool).

      mutate_each and summarize_each are indeed powerful. Though remember you avoided part of the problem when you typed in the name of the result column (you did not take it from the rname variable).

      The main reason they work nicely is we can (in this case) parametrize over the primary non-standard evaluation path by the use of funs(), “.“, and one_of(). I.e., we were not forced to use mutate_each_() to parameterize.

      It would be a bit of a challenge (involving either mutate_each_() or funs_()) to reproduce the following exactly without typing in column names (the presence of the extra column plus the non-conventional naming of the result are a bit hard to push into the mutate_each form).

      library("dplyr")
      library("wrapr")
      d <- data.frame(x= c(1, NA), y= c(2, 3))
      cname <- 'x'
      rname <- 'xcalc'
      let(c(RES=rname, CNAME=cname),
          d %>% mutate(RES= is.na(CNAME))
      )
      
       # x y xcalc
       # 1  1 2 FALSE
       # 2 NA 3  TRUE
      

      The wrapr::let solution is generic: small changes in the problem did not require significant changes in the code.

      For example (as I am sure you know) writing the following is not enough:

      
      d %>% mutate_each(funs(rname= is.na(.)), 
                        one_of(cname))
      

      And again the above is only pleasant to parameterize over as dplyr::one_of() uses standard evaluation (takes “variables in character vector”).

      The is.na() calculation is only meant as a simple notional example. I am not trying to say I don’t know how to find NA values easily or use complete.cases. We give computing over many columns as an example- as it is an example where people are willing to accept you can’t hard-code the column name. But there are many other examples (just not always as succinctly accepted) where you are supplying a service or function and you need to calculate over and land one or more columns that you do not know the exact names of when you are writing the code.

      What I want to demonstrate is wrapr::let is an easy way to program over non-standard interfaces. It is so important what the source of the non-standard interface example is, but more so that they are all easy to re-wrap.

      I know these long response make me look a bit like a bully. And I do appreciate your input and apologize for writing so long. I guess an additional point I would like to make (I know more length!) is: non-standard evaluation interfaces are more of a burden than R users seem to appreciate (and dplyr is probably better engineered than most people appreciate to mitigate so many of the negative consequences).

%d