Menu Home

Parametric variable names and dplyr

When writing reusable code or packages you often do not know the names of the columns or variables you need to work over. This is what I call “parametric treatment of variables.” This can be a problem when using R libraries that assume you know the variable names. The R data manipulation library dplyr currently supports parametric treatment of variables through “underbar forms” (methods of the form dplyr::*_), but their use can get tricky.

NewImage
Rube Goldberg machine 1931 (public domain).

Better support for parametric treatment of variable names would be a boon to dplyr users. To this end the replyr package now has a method designed to re-map parametric variable names to known concrete variable names. This allows concrete dplyr code to be used as if it was parametric.

The Problem

dplyr is a library that prefers you know the name of the column you want to work with. This is great when performing a specific analysis, but somewhat painful when supplying re-usable functions or packages. dplyr has a complete parametric interface with the “underbar forms” (for example: using dplyr::filter_ instead of dplyr::filter). However, the underbar notation (and the related necessary details around specifying lazy evaluation of formulas) rapidly becomes difficult.

As an attempted work-around replyr now supplies an adapter that applies a mapping from column names you have (which can be supplied parametrically) to concrete column names you wish you had (which would allow you to write dplyr code simply in terms of known or assumed column names).

It is easier to show than explain.

An Example

First we set up our libraries and type in some notional data as our example:

library('dplyr')
library('replyr')

d <- data.frame(Sepal_Length=c(5.8,5.7),
                   Sepal_Width=c(4.0,4.4),
                   Species='setosa',
                   rank=c(1,2))
print(d)

 #   Sepal_Length Sepal_Width Species rank
 # 1          5.8         4.0  setosa    1
 # 2          5.7         4.4  setosa    2

Then we rename the columns to standard values while restricting to only the named columns (this is the magic step):

nmap <- c(GroupColumn='Species',
             ValueColumn='Sepal_Length',
             RankColumn='rank')
dtmp <- replyr_mapRestrictCols(d,nmap)
print(dtmp)

 #   GroupColumn ValueColumn RankColumn
 # 1      setosa         5.8          1
 # 2      setosa         5.7          2

At this point you do know the column names (they are the ones you picked) and can write nice neat dplyr. You can then do your work:

# pretend this block is a huge sequence of complicated and expensive operations.
dtmp %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero

Notice we were able to use dplyr::mutate without needing to use dplyr::mutate_ (and without needing to go to Stack Overflow to lookup the lazy-eval notation yet again; imagine the joy in never having to write “dplyr::mutate_(.dots=stats::setNames(ni,ti))” ever again).

Once you have your desired result you restore the original names of our restricted column set:

invmap <- names(nmap)
names(invmap) <- as.character(nmap)
replyr_mapRestrictCols(dtmp,invmap)

 #   Species Sepal_Length rank
 # 1  setosa          5.8    0
 # 2  setosa          5.7    1

If you haven’t worked a lot with dplyr this won’t look that interesting. If you do work a lot with dplyr you may have been asking for something like this for quite a while. If you use dplyr::*_ you will love replyr::replyr_mapRestrictCols. Be aware: replyr::replyr_mapRestrictCols is a bit of a hack; it mutates all of the columns it is working with, which is unlikely to be a cheap operation.

A Proposal

I feel the replyr::replyr_mapRestrictCols interface represents the correct design for a better dplyr based adapter.

I’ll call this the “column view stack proposal.” I would suggest the addition of two operators to dplyr:

  1. view_as(df,columnNameMap) takes a data item and returns a data item reference that behaves as if the column names have been re-mapped.
  2. unview() removes the view_as annotation.

Obviously there is an issue of nested views, I would suggest maintaining the views as a stack while using the composite transformation implied by the stack of mapping specifications. I am assuming dplyr does not currently have such a facility. Another possibility is a term-rewriting engine to re-map formulas from standard names to target names, but this is what the lazy-eval notations are already attempting (and frankly it isn’t convenient or pretty).

I would also suggest that dplyr::arrange be enhanced to have a visible annotation (just the column names it has arranged by) that allows the user to check if the data is believed to be ordered (crucial for window-function applications). With these two suggestions dplyr data sources would support three primary annotations:

  1. Groups: placed by dplyr::group_by, removed by dplyr::ungroup, and viewed by dplyr::groups.
  2. Orders: placed by dplyr::arrange, removed by Xdplyr::unarrange (just removes annotation, does not undo arrangement; annotation also removed by any operation that re-orders the data, such as join), and viewed by Xdplyr::arrangement. The annotation is “fragile” (many operations lose the annotation), but some operations (such as the joins) can add an optional argument such as “preserveArrangement” to let the user signal they are willing to expend some effort to have the arrangement preserved. The ability for pipeline to trigger an exception if an operation is expecting arranged data and it is not arranged would also be handy.
  3. Column Views: placed by Xdplyr::view_as, removed by Xdplyr::unview, and viewed by Xdplyr::views.

The “Xdplyr::” items are the extensions that are being proposed.

Another variation is the “view object” view_of(df,columnNameMap) which builds a reference object that uses the new column names and effects translated changes on the original object. In this variation the user has more direct control of the view composition.

An Alternative Proposal

Another possibility would be some sort of “let” statement that controls name bindings for the duration of a block of code.

I’ll call this the “let block proposal.” The advantage of “let” is the block goes in and out of scope in an orderly manner, the disadvantage is the re-namings are not shared with called functions.

Using such a statement we would write our above example calculation as:

let(
 list(RankColumn='rank',GroupColumn='Species'),
 {
  # pretend this block is a huge sequence of complicated and expensive operations.
  d %>% mutate(RankColumn=RankColumn-1) -> dtmp # start ranks at zero
 }
)

The idea is the items 'rank' and 'Species' could be passed in parametrically (notice the let specification is essentially nmap, so we could just pass that in). This isn’t quite R‘s “with” statement as we are not binding names to values, but names to names. Essentially we are asking for macro facility that is compatible with dplyr remote data sources (and the non-standard evaluation methods used to capture variables names).

It turns out gtools::strmacro is nearly what we need. For example following works:

gtools::strmacro(
  RankColumn='rank',
  expr={ d %>% mutate(RankColumn=RankColumn-1) }
)()

But the above stops just short of taking in the original column names parametrically. The following does not work:

RankColumnName % mutate(RankColumn=RankColumn-1) }
)()

I was was able to adapt code from gtools::strmacro to create a working let-block implemented as replyr::let:

replyr::let(
   alias=nmap,
   expr={
     d %>% mutate(RankColumn=RankColumn-1) 
   })

 #   Sepal_Length Sepal_Width Species rank
 # 1          5.8         4.0  setosa    0
 # 2          5.7         4.4  setosa    1

Conclusion

I feel the above methods will make working with parameterized variables in dplyr much easier.

Categories: Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

14 replies

  1. still confused. How does this help me write dplyr commands inside a function where the column names have to be passed to the function.

  2. The idea is you write the interior of the function as if you enforced a given set of column names on your users. For our example we would write our function assuming the column names were exactly ‘GroupColumn’, ‘ValueColumn’, and ‘RankColumn’. Then you just add a bit of code at the top of your function re-mapping the user supplied column names to these fixed columns and another bit to map back from the fixed column names to the user supplied ones.

    Think of it as having some code lying around that works for a specific data.frame (with known column names). You just copy-paste that code into a function body and add the pre-mapping and un-mapping (as described above) and the original code now masquerades as a general function. You would have to add arguments to the function that take the names (so you know what to map to), but that isn’t too hard.

    1. It might just be me, but I could not get substitute to completely work for this task. Notice in the example below it adds a new column instate of mutating “rank” in place:

      library('dplyr')
      d <- data.frame(Sepal_Length=c(5.8,5.7),
                      Sepal_Width=c(4.0,4.4),
                      Species='setosa',
                      rank=c(1,2))
      eval(substitute(d %>% mutate(RankColumn=RankColumn-1),
                      list(RankColumn=quote(rank))))
      
       #   Sepal_Length Sepal_Width Species rank RankColumn
       # 1          5.8         4.0  setosa    1          0
       # 2          5.7         4.4  setosa    2          1
      

      But if substitute can be made to work or improves on the implementation of let I’d be happy to take a pull request ( https://github.com/WinVector/replyr/blob/master/R/let.R ). There are some subtleties when you want to re-map (possibly unbound) names instead of values in the presence of non-standard evaluation. Looking at the differences in gtools::defmacro and gtools::strmacro shows some of the distinctions.

    1. Thanks for the info.

      I think the best solutions are going to be in dplyr itself (which is why I made the dplyr “column view stack proposal” public on December 3rd 2016). The physical re-mapping and let-blocks are both work-arounds until something better is available. Given Hadley Wickham’s comment of December 5th 2016 it looks like something is on the way.

      The replyr package is itself a collection of work-arounds, it is just replyr::let and replyr::gapply are currently the ones with broadest impact.

      That being said I am actually really pleased with how well replyr::let is working on some example problems.

  3. The sort of adaption we really want: building a light weight view that appears to re-map data columns is something that is very easy in classic interface based object-oriented languages. We would just create a reference class that claims to implement a tbl interface, and it would re-route all operations to an encapsulated instance. This sort of adaption isn’t as easy in a functional setting as we really don’t know where there is an impenetrable abstraction layer and have no way to enforceably document such an interface or layer. This is one of the cases that the “functional programming is always better than object oriented programming” argument just assumes away. Or remember: even in the land of the verbs (functional programming) data structures are pretty much nouns (objects).

  4. replyr is a good idea, however, it seems that I can do the same thing just construct a new function.

  5. I think I can do the same with magrittr::set_colnames. What’s more I can do it (almost) inside one %>% chain:

    originalColNames% set_colnames(standardColNames) %>% ... (some dplyr stuff) ... %>% set_colnames(originalColNames)

    1. Thanks Lukasz,

      That sounds about right as it looks like magrittr::set_colnames is intended as an alias for `colnames<-`. So it works like replyr::replyr_mapRestrictCols.

      However I would want to check if it is “dplyr pure” which I define as working with arbitrary dplyr data services (i.e. only using dplyr verbs). This matters for remote dplyr services (such as Spark and PostgreSQL) .

      print(magrittr::set_colnames) shows a lot of data.frame and matrix specific manipulations (such as names(x) <- value which I think is not the dplyr verb for renaming a column).

      (Edit: confirming that print(magrittr::set_colnames) does not work on remote dplyr data service handles.)

      library("replyr")
      library("magrittr")
      
      # works on local data frames
      d <- data.frame(x= 1:2)
      set_colnames(d,'x2')
      replyr_mapRestrictCols(d,list(x='x2'), reverse=TRUE)
      
      my_db <- dplyr::src_sqlite(":memory:", create = TRUE)
      d2 <- replyr_copy_to(my_db, d, 'd2')
      replyr_mapRestrictCols(d2,list(x='x2'), reverse=TRUE)
      
      # does not work on dplyr handles
      set_colnames(d2,'x2')
      
%d bloggers like this: