This article is on writing sweet R
code using the wrapr
package.
The problem
Consider the following R
puzzle. You are given: a data.frame
, the name of a column that you wish to find missing values (NA
) in, and the name of a column to land the result. For instance:
d <- data.frame(x = c(1, NA)) print(d) # x # 1 1 # 2 NA cname <- 'x' print(cname) # [1] "x" rname <- paste(cname, 'isNA', sep = '_') print(rname) # [1] "x_isNA"
How do you write generic code to populate the column x_isNA
with which rows of x
are missing?
The “base R” solution
In “base R
” (R without additional packages) this is easy.
When you know the column names while writing the code:
d2 <- d d2$x_isNA <- is.na(d2$x) print(d2) # x x_isNA # 1 1 FALSE # 2 NA TRUE
And when you don’t know the column names while writing the code (but know they will arrive in variables later):
d2 <- d d2[[rname]] <- is.na(d2[[cname]])
The “base R” solution really is quite elegant.
The “all in” non-standard evaluation dplyr::mutate
solution
As far as I can tell the “all in” non-standard evaluation dplyr::mutate
solution is something like the following.
When you know the column names while writing the code:
library("dplyr") d %>% mutate(x_isNA = is.na(x))
And when you don’t know the column names while writing the code (but know they will arrive in variables later):
d %>% mutate_(.dots = stats::setNames(list(lazyeval::interp( ~ is.na(VAR), VAR = as.name(cname) )), rname))
The sweet wrapr::let
dplyr::mutate
solution
We will only work the harder “when you don’t yet know the column name” (or parametric) version:
library("wrapr") let(list(COL = cname, RES = rname), d %>% mutate(RES = is.na(COL)) )
I think that this is pretty sweet, and can really level up your dplyr
game.
wrapr::let
is available from CRAN
and already has a number of satisfied users:
If function behavior depends on variable names, then convenient control of functions is eventually going to require convenient control of variable names; so needing to re-map variable names at some point is inevitable.
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Another cool solution would be what I am calling a “view frame.” That is: a reference style object that looks to
R
like adata.frame
(or any class that claims to extend it such astbl
) but re-maps column names to another referred todata.frame
.I am not a regular
data.table
user, but this seems like something that package may already (or could easily) supply.Any
R
function or package that relies heavily on non-standard evaluation can benefit from parametric notation (such as introduced bywrapr::let
). It isn’t just coding around things, but creating new capabilities (that are ready to be wrapped as re-usable functions). The more the system relies on non-standard evaluation, the larger the potential benefit (which is how I have been picking examples).For example:
wrap::let
can also be used with knitr markdown which looks like the following:The connection is: parameterized
knitr
converts theyaml
header into the data structureparams
, which is already in the correct format forwrapr::let
(therestrictToNameAssignments()
call is just demonstrating the additional capability of filtering out non-name assignments, and is not strictly necessary).I also discuss parametric markdown in the following screencast:
[youtube https://www.youtube.com/watch?v=iKLGxzzm9Hk?ecver=1&w=560&h=315%5D
A lot of the power of
R
is being able to script and program over data and standard evaluation functions; being able to conveniently script and program over non-standard evaluation adds even more power.Nice! You can also use the standard evaluation version of mutate:
mutate_(d, .dots = setNames(list(is.na(cname)), rname))
That would be nice, but it does not work. I think what that is calculating is if the variable
cname
is a missing value or not (and not calculating facts about thedata.frame
column):It is interesting the
list()
delays execution (which was the latest improvement I learned about), but a few more tricks are needed to get the correct outcome (which is what was pointed out to me here).The article is already using the standard eval path, it is just so buried in the adaptions that it is hard to see the underbar.
I have heard a few times (1, 2, 3) that big changes are coming to
lazyeval
and/or the standard interface paths indplyr
, but frankly that is just another reason to not waste time mastering the minutia of the currentdplyr
standard interface.Also if WordPress mangled out some important part of your solution, I do apologize (WordPress does not like code in comments very much).
I like the idea of “let” bindings in R, but I will point out that there is a much easier way to apply functions across columns in dplyr: use “mutate_each”.
Aaron,
Thanks for the comment. And you are right I should have mentioned
mutate_each
(it is a great tool).mutate_each
andsummarize_each
are indeed powerful. Though remember you avoided part of the problem when you typed in the name of the result column (you did not take it from thername
variable).The main reason they work nicely is we can (in this case) parametrize over the primary non-standard evaluation path by the use of
funs()
, “.
“, andone_of()
. I.e., we were not forced to usemutate_each_()
to parameterize.It would be a bit of a challenge (involving either
mutate_each_()
orfuns_()
) to reproduce the following exactly without typing in column names (the presence of the extra column plus the non-conventional naming of the result are a bit hard to push into themutate_each
form).The
wrapr::let
solution is generic: small changes in the problem did not require significant changes in the code.For example (as I am sure you know) writing the following is not enough:
And again the above is only pleasant to parameterize over as
dplyr::one_of()
uses standard evaluation (takes “variables in character vector”).The
is.na()
calculation is only meant as a simple notional example. I am not trying to say I don’t know how to findNA
values easily or usecomplete.cases
. We give computing over many columns as an example- as it is an example where people are willing to accept you can’t hard-code the column name. But there are many other examples (just not always as succinctly accepted) where you are supplying a service or function and you need to calculate over and land one or more columns that you do not know the exact names of when you are writing the code.What I want to demonstrate is
wrapr::let
is an easy way to program over non-standard interfaces. It is so important what the source of the non-standard interface example is, but more so that they are all easy to re-wrap.I know these long response make me look a bit like a bully. And I do appreciate your input and apologize for writing so long. I guess an additional point I would like to make (I know more length!) is: non-standard evaluation interfaces are more of a burden than
R
users seem to appreciate (anddplyr
is probably better engineered than most people appreciate to mitigate so many of the negative consequences).I have a lot of “vtreat: prepare data” stickers to hand out. I am almost out of “wrapr: sweet R code” stickers, but if you really want one it is available from Sticker Mule (priced a bit high, but we are not taking a cut): https://www.stickermule.com/marketplace/17313-wrapr-for-sweet-r-code?utm_swu=6453