I have written about referential transparency before. In this article I would like to discuss “leaky abstractions” and why wrapr::let()
supplies a useful (but leaky) abstraction for R
programmers.
Abstractions
A common definition of an abstraction is (from the OSX
dictionary):
the process of considering something independently of its associations, attributes, or concrete accompaniments.
In computer science this is commonly taken to mean “what something can be thought to do independent of caveats and implementation details.”
The magrittr
abstraction
In R
one traditionally thinks of the magrittr "%>%"
pipe abstractly in the following way:
Once "library(magrittr)" is loaded we can treat the expression:7 %>% sqrt()
as if the programmer had written:sqrt(7)
.
That is the abstraction of magrittr
into terms one can reason about and plan over. You think of x %>% f()
as a synonym for f(x)
. This is an abstraction because magrittr
is not in fact implemented as a macro source-code re-write, but in in terms of function argument capture and delayed evaluation. And as Joel Spolsky famously wrote:
All non-trivial abstractions, to some degree, are leaky.
The magrittr
pipe is non-trivial (in the sense of doing interesting work) because it works as if it were a syntax replacement even though you can use it more places than you could ask for such a syntax replacement. The upside is: magrittr
makes two statements behave nearly equivalently. The downside is: we expect this to fail in some corner cases. This is not a criticism; it is as Bjarne Stroustrup wrote:
There are only two kinds of languages: the ones people complain about and the ones nobody uses.
The tidyeval
/rlang
abstraction
The package dplyr 0.5.0.9004
brings in a new package called rlang
to supply a capability called tidyeval
. Among the abstractions it supplies are: operators for quoting and un-quoting variable names. This allows code like the following, where a dplyr::select()
takes a variable name from a user supplied variable (instead of the usual explicit take from the text of the dplyr::select()
statement).
# devtools::install_github('tidyverse/dplyr') library("dplyr") packageVersion("dplyr") # [1] ā0.5.0.9004ā varName = quo(disp) mtcars %>% select(!!varName) %>% head() # disp # Mazda RX4 160 # Mazda RX4 Wag 160 # Datsun 710 108 # Hornet 4 Drive 258 # Hornet Sportabout 360 # Valiant 225
Notice in the above example we had to specify the abstract varName
by calling quo()
on a free variable name (disp
) and did not take the value from a string. [updated 2017-05-03] To work with a string contained in another variable the syntax is:
varName <- as.name(colnames(mtcars)[[1]]) mtcars %>% select(!!varName) %>% head()
or:
varName <- rlang::sym(colnames(mtcars)[[1]]) mtcars %>% select(!!varName) %>% head()
The wrapr::let()
abstraction
Our wrapr
package can abstract the recent example (working over strings instead of “quosure
” classes) as follows.
The (leaky) abstraction is:
“
varName <- 'var'; wrapr::let(VAR=varName, expr(VAR))
” is treated as if the user had writtenexpr(var)
.
This can be also thought of as form of unquoting as you do see one set of quotes disappear.
Let’s try it:
library("wrapr") x <- 5 varName <- 'x' VAR <- NULL # make sure macro target does not look like an unbound reference let(c(VAR=varName), VAR) # [1] 5
The NULL
assignment is not needed, but adding something like that prevents CRAN
style checks from thinking the macro replacement target VAR
is an unbound variable in the let block. I’ll leave this out of the later examples for conciseness.
Or moving back to our dplyr::select()
example:
varName <- 'disp' let( c(VARNAME = varName), mtcars %>% select(VARNAME) %>% head() ) # disp # Mazda RX4 160 # Mazda RX4 Wag 160 # Datsun 710 108 # Hornet 4 Drive 258 # Hornet Sportabout 360 # Valiant 225
And wrapr::let()
can also conveniently handle the “varName <- colnames(mtcars)[[1]]
” case.
An issue
dplyr
issue 2726 (reproduced below) discusses a very important and interesting issue.
At a cursory glance the two discussed expressions and the work-around may seem alien, artificial, or even silly:
(function(x) select(mtcars, !!enquo(x)))(disp)
(function(x) mtcars %>% select(!!enquo(x)))(disp)
(function(x) { x <- enquo(x); mtcars %>% select(!!x)})(disp)
However, this is actually a very crisp and incisive example. In fact, if rlang
/tidyeval
were a system up for public revision (such as a RFC or some such proposal) you would expect the equivalence of the above to be part of an acceptance suite.
The first expression looks very much like rlang
/tidyeval
package examples and is the “right way” in rlang
/tidyeval
to send in a column name parametrically. It is in the style preferred by the new package so by the package standards can not be considered complicated, perverse, or verbose. The second expression differs from the first only by the application of the “magrittr
invariant” of “x %>% f()
is to be considered equivalent to f(x)
“.
The outcome is the first expression currently executes as expected, and the second expression errors-out. This can be considered surprising as this is not something anticipated in the documentation or recipes for building up tidy expressions. This is a leak in the combined abstractions, something we are told to back away from as it doesn’t work.
The proposed work-around (expression 3) is helpful, but itself demonstrates another leak in the mutual abstractions. Think of it this way: suppose we had started with expression 3 as working code. We would by referential transparency expect to be able to refactor the code and replace x
with its value and move from this third working example to the second expression (which happens to fail).
To summarize: expressions 1 and 3 are equivalent. They differ by two refactoring steps (introduction/removal of pipes, and introduction/removal of a temporary variable). But we can not demonstrate the equivalence by interpolating in 2 named transformations (going from 1 to 2 to 3, or from 3 to 2 to 1) as the intermediate expression 2 is apparently not valid.
The wrapr::let version of the issue author’s desired expression 2 is:
(function(x) let(c(X = x), mtcars %>% select(X)))('disp')
Conclusion
wrapr::let()
is a useful abstraction:
- It directly takes strings as variable names (the most common source of parametric variable names).
- It is a marco-like replacement and easy to teach as a code re-writing abstraction.
- It has a small interaction surface, and plays well with delayed evaluation packages such as
magrittr
anddplyr 0.5.0
. - It is “future proof” in the sense it should work with both
dplyr 0.5.0
and the comingdply 0.6.*
.
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
And the Bizarro Pipe version of expression 2 is:
(which works correctly).
What is going on is the packages
magrittr
andrlang
both currently consume too much referential transparency to be currently considered fully compatible with each other.This is related to situations such as
dplyr
issue 2080 where “dplyr and purrr (or magrittr?) are fighting over what . means”, andmagrittr
issue 141 which seems to be asking for more deliberate cooperation between the packages (implying coordination problems without such explicit accommodations).can you provide let equivalent to the new dplyr SE documentation
I assume you mean equivalent to the new NSE documentation?
We do have articles and documentation on
wrapr::let()
including:A vignette:
vignette('let', package='wrapr')
.The method help:
help('let', package='wrapr')
Examples: Using replyr::let to Parameterize dplyr Expressions.
A video lecture: My recent BARUG talk: Parametric Programming in R with replyr.
Some notes: Parametric Programming in R.
The package introduction: The
wrapr
introduction.Comparison to the soon to deprecated SE underbar/underscore methods: Comparative examples using replyr::let.
And many more examples on our blog.
Also I think we should assume the original issue reporter knew of
one_of()
. Issues reports have to be taken with some trust that they do in fact originate from a meaningful use case, and that prior to being simplified down to an issue report many of the “you could just do x instead” options may not have been available.I know that I wasn’t the one asked the following question:
I would say my counter answer is as follows.
Designing top-down from a use case (instead of designing bottom-up from desired capabilities or systems) keeps the code simple and gives us answers to a lot of design decisions (such as should column names cary environments? The answer being: “no”, as the user has no intent to use the environment that happens to be present when they do specify the column name.)
I actually have allowed the system to be a bit more generic than just working with column names, and that is largely to make everything more orthogonal or regular (and hence easier for the user to reason about). Hence you can re-map arbitrary variables to other variables as follows:
wrapr::let
also prohibits a large number of things to deliberately limit the scope of the system and help users find errors much closer to causes. For examplewrapr::let
only binds names to names, not names to values. For example the following is not allowed:And this is because
R
already has much better ways to map names to values (i.e. its standard execution environments) and we don’t want to needlessly displaceR
‘s built in execution semantics where we do not need to do so.If you have the time I suggest watching my screencast on let-substitition. I spend a lot of time defining the use-case and what
wrapr::let
does, so you can quickly tell if whatwrapr::let
does, and thus if it solves a problem you have or not. Sincewrapr::let
doesn’t try to do everything, it is often clear what one is using for (i.e., use can be somewhat self-documenting).A nice article about dplyr 0.6 rlang/tidyeval can be found here. My quibble is: while one may rightly be uncomfortable with expression capture followed by substitution (as this is not how other languages implement macros), is that expression capture is already used a lot in R (and a very lot in rlang/tidyeval). Also wrapr has been very reliable and stable in production. Whereas rlang/tidyeval has a lot of “expression A works, but A’ does not” issues which are collecting as what are commonly called “closed won’t fix” issues in the dplyr and rlang/tidyeval repositories (I know from earlier interactions that these teams don’t seem to use this terminology theirselves).
The take-away: wrapr is reliable.