When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R
libraries that assume you know the variable names. The R
data manipulation library dplyr
currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_
), but their use can get tricky.
Rube Goldberg machine 1931 (public domain).
Better support for parametric treatment of variable names would be a boon to dplyr
users. To this end the replyr
package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr
code to be used as if it was parametric.
The Problem
dplyr
is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. dplyr
has a complete parametric interface with the “underbar forms” (for example: using dplyr::filter_
instead of dplyr::filter
). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.
As an attempted work-around replyr
now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write dplyr
code simply in terms of known or assumed column names).
It is easier to show than explain.
An Example
First we set up our libraries and type in some notional data as our example:
library('dplyr') library('replyr') d <- data.frame(Sepal_Length=c(5.8,5.7), Sepal_Width=c(4.0,4.4), Species='setosa', rank=c(1,2)) print(d) # Sepal_Length Sepal_Width Species rank # 1 5.8 4.0 setosa 1 # 2 5.7 4.4 setosa 2
Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):
nmap <- c(GroupColumn='Species', ValueColumn='Sepal_Length', RankColumn='rank') dtmp <- replyr_mapRestrictCols(d,nmap) print(dtmp) # GroupColumn ValueColumn RankColumn # 1 setosa 5.8 1 # 2 setosa 5.7 2
At this point you do know the column names (they are the ones you picked) and can write nice neat dplyr
. You can then do your work:
# pretend this block is a huge sequence of complicated and expensive operations. dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
Notice we were able to use dplyr::mutate
without needing to use dplyr::mutate_
(and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again; imagine the joy in never having to write “dplyr::mutate_(.dots=stats::setNames(ni,ti))
” ever again).
Once you have your desired result you restore the original names of our restricted column set:
invmap <- names(nmap) names(invmap) <- as.character(nmap) replyr_mapRestrictCols(dtmp,invmap) # Species Sepal_Length rank # 1 setosa 5.8 0 # 2 setosa 5.7 1
If you haven’t worked a lot with dplyr
this won’t look that interesting. If you do work a lot with dplyr
you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love replyr::replyr_mapRestrictCols
. Be aware: replyr::replyr_mapRestrictCols
is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.
A Proposal
I feel the replyr::replyr_mapRestrictCols
interface represents the correct design for a better dplyr
based adapter.
I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to dplyr
:
view_as(df,columnNameMap)
takes a data item and returns a data item reference that behaves as if the column names have been re-mapped.unview()
removes theview_as
annotation.
Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming dplyr
does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).
I would also suggest that dplyr::arrange
be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions dplyr
data sources would support three primary annotations:
Groups
: placed bydplyr::group_by
, removed bydplyr::ungroup
, and viewed bydplyr::groups
.Orders
: placed bydplyr::arrange
, removed byXdplyr::unarrange
(just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such asjoin
), and viewed byXdplyr::arrangement
. The annotation is “fragile” (many operations lose the annotation), but some operations (such as the joins) can add an optional argument such as “preserveArrangement
” to let the user signal they are willing to expend some effort to have the arrangement preserved. The ability for pipeline to trigger an exception if an operation is expecting arranged data and it is not arranged would also be handy.Column Views
: placed byXdplyr::view_as
, removed byXdplyr::unview
, and viewed byXdplyr::views
.
The “Xdplyr::
” items are the extensions that are being proposed.
Another variation is the “view object” view_of(df,columnNameMap)
which builds a reference object that uses the new column names and effects translated changes on the original object. In this variation the user has more direct control of the view composition.
An Alternative Proposal
Another possibility would be some sort of “let
” statement that controls name bindings for the duration of a block of code.
I’ll call this the “let block proposal.” The advantage of “let
” is the block goes in and out of scope in an orderly manner, the disadvantage is the re-namings are not shared with called functions.
Using such a statement we would write our above example calculation as:
let( list(RankColumn='rank',GroupColumn='Species'), { # pretend this block is a huge sequence of complicated and expensive operations. d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero } )
The idea is the items 'rank'
and 'Species'
could be passed in parametrically (notice the let
specification is essentially nmap
, so we could just pass that in). This isn’t quite R
‘s “with
” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with dplyr
remote data sources (and the non-standard evaluation methods used to capture variables names).
It turns out gtools::strmacro
is nearly what we need. For example following works:
gtools::strmacro( RankColumn='rank', expr={ d %>% mutate(RankColumn=RankColumn-1) } )()
But the above stops just short of taking in the original column names parametrically. The following does not work:
RankColumnName % mutate(RankColumn=RankColumn-1) } )()
I was was able to adapt code from gtools::strmacro
to create a working let
-block implemented as replyr::let
:
replyr::let( alias=nmap, expr={ d %>% mutate(RankColumn=RankColumn-1) }) # Sepal_Length Sepal_Width Species rank # 1 5.8 4.0 setosa 0 # 2 5.7 4.4 setosa 1
Conclusion
I feel the above methods will make working with parameterized variables in dplyr
much easier.
Categories: Opinion
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
still confused. How does this help me write dplyr commands inside a function where the column names have to be passed to the function.
The idea is you write the interior of the function as if you enforced a given set of column names on your users. For our example we would write our function assuming the column names were exactly ‘GroupColumn’, ‘ValueColumn’, and ‘RankColumn’. Then you just add a bit of code at the top of your function re-mapping the user supplied column names to these fixed columns and another bit to map back from the fixed column names to the user supplied ones.
Think of it as having some code lying around that works for a specific data.frame (with known column names). You just copy-paste that code into a function body and add the pre-mapping and un-mapping (as described above) and the original code now masquerades as a general function. You would have to add arguments to the function that take the names (so you know what to map to), but that isn’t too hard.
Regarding your “let” possibility, you might want to check out substitute … ;)
It might just be me, but I could not get
substitute
to completely work for this task. Notice in the example below it adds a new column instate of mutating “rank” in place:But if
substitute
can be made to work or improves on the implementation oflet
I’d be happy to take a pull request ( https://github.com/WinVector/replyr/blob/master/R/let.R ). There are some subtleties when you want to re-map (possibly unbound) names instead of values in the presence of non-standard evaluation. Looking at the differences ingtools::defmacro
andgtools::strmacro
shows some of the distinctions.Note: we have replaced replyr_renameRestrictCols with replyr_mapRestrictCols which specifies the map in a different order.
Just fyi, there was a conversation on this topic recently on reddit and Hadley mentioned he has some features coming in this vein already:
https://www.reddit.com/r/rstats/comments/5gm658/how_to_insert_variable_contents_into_dplyr/
Thanks for the info.
I think the best solutions are going to be in
dplyr
itself (which is why I made thedplyr
“column view stack proposal” public on December 3rd 2016). The physical re-mapping and let-blocks are both work-arounds until something better is available. Given Hadley Wickham’s comment of December 5th 2016 it looks like something is on the way.The
replyr
package is itself a collection of work-arounds, it is justreplyr::let
andreplyr::gapply
are currently the ones with broadest impact.That being said I am actually really pleased with how well
replyr::let
is working on some example problems.The sort of adaption we really want: building a light weight view that appears to re-map data columns is something that is very easy in classic interface based object-oriented languages. We would just create a reference class that claims to implement a
tbl
interface, and it would re-route all operations to an encapsulated instance. This sort of adaption isn’t as easy in a functional setting as we really don’t know where there is an impenetrable abstraction layer and have no way to enforceably document such an interface or layer. This is one of the cases that the “functional programming is always better than object oriented programming” argument just assumes away. Or remember: even in the land of the verbs (functional programming) data structures are pretty much nouns (objects).replyr is a good idea, however, it seems that I can do the same thing just construct a new function.
Definitely you can get related effects/transforms other ways.
replyr
is one way, and is in fact handy in building wrapping functions. Nina Zumel has nice example here: https://cran.r-project.org/web/packages/replyr/vignettes/letExample.htmlI think I can do the same with
magrittr::set_colnames
. What’s more I can do it (almost) inside one%>%
chain:originalColNames% set_colnames(standardColNames) %>% ... (some dplyr stuff) ... %>% set_colnames(originalColNames)
Thanks Lukasz,
That sounds about right as it looks like
magrittr::set_colnames
is intended as an alias for`colnames<-`
. So it works likereplyr::replyr_mapRestrictCols
.However I would want to check if it is “
dplyr
pure” which I define as working with arbitrarydplyr
data services (i.e. only usingdplyr
verbs). This matters for remotedplyr
services (such as Spark and PostgreSQL) .print(magrittr::set_colnames)
shows a lot ofdata.frame
andmatrix
specific manipulations (such asnames(x) <- value
which I think is not thedplyr
verb for renaming a column).(Edit: confirming that
print(magrittr::set_colnames)
does not work on remotedplyr
data service handles.)I never worked with remote dplyr services. Now I know I have to be careful with them :) Thank you!
I finally got substitute and other non-replyr examples to work and cleaned up: https://github.com/WinVector/replyr/blob/master/vignettes/ParametricExample.Rmd .