Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In either case, you may want the summary broken down with respect to groupings in the data. In other words, you want a table of values, something like this:

dist_intervals(iris, "Sepal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper 1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000 2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500 3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375

For a specific data frame, with known column names, such a table is easy to construct using `dplyr::group_by`

and `dplyr::summarize`

. But what if you want a function to calculate this table on an arbitrary data frame, with arbitrary quantity and grouping columns? To write such a function in `dplyr`

can get quite hairy, quite quickly. Try it yourself, and see.

Enter `let`

, from our new package `replyr`

.

`replyr::let`

implements a mapping from the “symbolic” names used in a `dplyr`

expression to the names of the actual columns in a data frame. This allows you to encapsulate complex `dplyr`

expressions without the use of the `lazyeval`

package, which is the currently recommended way to manage `dplyr`

‘s use of non-standard evaluation. Thus, you could write the function to create the table above as:

# to install replyr: # devtools::install_github('WinVector/replyr') library(dplyr) library(replyr) # # calculate mean +/- sd intervals and # median +/- 1/2 IQR intervals # for arbitrary data frame column, with optional grouping # dist_intervals = function(dframe, colname, groupcolname=NULL) { mapping = list(col=colname) if(!is.null(groupcolname)) { dframe %>% group_by_(groupcolname) -> dframe } let(alias=mapping, expr={ dframe %>% summarize(sdlower = mean(col)-sd(col), mean = mean(col), sdupper = mean(col) + sd(col), iqrlower = median(col)-0.5*IQR(col), median = median(col), iqrupper = median(col)+0.5*IQR(col)) }) }

The mapping is specified as a list of assignments *symname*=*colname*, where *symname* is the name used in the `dplyr`

expression, and *colname* is the name (as a string) of the corresponding column in the data frame. We can now call our `dist_intervals`

on the `iris`

dataset:

dist_intervals(iris, "Sepal.Length") sdlower mean sdupper iqrlower median iqrupper 1 5.015267 5.843333 6.671399 5.15 5.8 6.45 dist_intervals(iris, "Sepal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper 1 setosa 4.653510 5.006 5.358490 4.8000 5.0 5.2000 2 versicolor 5.419829 5.936 6.452171 5.5500 5.9 6.2500 3 virginica 5.952120 6.588 7.223880 6.1625 6.5 6.8375 dist_intervals(iris, "Petal.Length", "Species") # A tibble: 3 × 7 Species sdlower mean sdupper iqrlower median iqrupper 1 setosa 1.288336 1.462 1.635664 1.4125 1.50 1.5875 2 versicolor 3.790089 4.260 4.729911 4.0500 4.35 4.6500 3 virginica 5.000105 5.552 6.103895 5.1625 5.55 5.9375

The implementation of `let`

is adapted from `gtools::strmacro`

by Gregory R. Warnes. Its primary purpose is for wrapping `dplyr`

, but you can use it to parameterize other functions that take their arguments via non-standard evaluation, like `ggplot2`

functions — in other words, you can use `replyr::let`

instead of `ggplot2::aes_string`

, if you are feeling perverse. Because `let`

creates a macro, you have to avoid variable collisions (for example, remapping `x`

in `ggplot2`

will clobber both sides of `aes(x=x)`

), and you should remember that any side effects of the expression will escape `let`

‘s execution environment.

The `replyr`

package is available on github. Its goal is to supply uniform `dplyr`

-based methods for manipulating data frames and `tbl`

s both locally and on remote (`dplyr`

-supported) back ends. This is a new package, and it is still going through growing pains as we figure out the best ways to implement desired functionality. We welcome suggestions for new functions, and more efficient or more general ways to implement the functionality that we supply.

Categories: Coding Computer Science Pragmatic Data Science

### Nina Zumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.

As Nina stated: replyr::let is based (with attribution) on gtools::strmacro. It sets up a useful macro replacement of names to names (not values, so it has somewhat different semantics than base::substitute, such as being able to re-write both left hand sides and right hand sizes of dplyr::mutate assignments).

The code in the let block can be arbitrarily large (so eventually the relative cost of the boilerplate goes down). help(let) has some more detailed examples including piping through a let block.

This is based in my earlier extensions to dplyr proposal, but is separate in that it is a stand-alone functioning work around.

Nina also demonstrates replyr::let with ggplot2 in vignette(‘letExample’,’replyr’). We also have overall documentation in vignette(‘replyr’,’replyr’).

replyr tries to be “dplyr pure.” If you write “pure dplyr code” (that is code that works on any dplyr back-end, i.e. avoiding d$column or d[[‘column’]] in favor of “d %>% select(column)” then the combined code should also work with many dplyr data services (we test with tbl, SQLite, MySQL, PostgreSQL, sparklyr/Spark1.6.2, sparklyr/Spark2.0.2), assuming you only use functions available on the service in question (which varies a lot for window functions like cumsum and functions like median).

Nina has an additional “why you should care” article here: https://ninazumel.com/2016/12/06/new-win-vector-package-replyr-for-easier-dplyr/ . We are not against lazyeval, just against the end user having to directly call lazyeval without any helpers/wrappers.

replyr now available on CRAN! https://CRAN.R-project.org/package=replyr