For R dplyr users one of the promises of the new rlang/tidyeval system is an improved ability to program over dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps.
Let’s take a look at this capability.
First let’s start dplyr.
suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.1.9000'
A dplyr pattern that I have seen used often is the "group_by() %>% mutate()" pattern. This historically has been shorthand for a "group_by() %>% summarize()" followed by a join(). It is easiest to show by example.
The following code:
mtcars %>%
group_by(cyl, gear) %>%
mutate(group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
select(cyl, gear, mpg, disp,
group_mean_mpg, group_mean_disp) %>%
head()
## # A tibble: 6 x 6
## # Groups: cyl, gear [4]
## cyl gear mpg disp group_mean_mpg group_mean_disp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 4 21.0 160 19.750 163.8000
## 2 6 4 21.0 160 19.750 163.8000
## 3 4 4 22.8 108 26.925 102.6250
## 4 6 3 21.4 258 19.750 241.5000
## 5 8 3 18.7 360 15.050 357.6167
## 6 6 3 18.1 225 19.750 241.5000
is taken to be shorthand for:
mtcars %>%
group_by(cyl, gear) %>%
summarize(group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
left_join(mtcars, ., by = c('cyl', 'gear')) %>%
select(cyl, gear, mpg, disp,
group_mean_mpg, group_mean_disp) %>%
head()
## cyl gear mpg disp group_mean_mpg group_mean_disp
## 1 6 4 21.0 160 19.750 163.8000
## 2 6 4 21.0 160 19.750 163.8000
## 3 4 4 22.8 108 26.925 102.6250
## 4 6 3 21.4 258 19.750 241.5000
## 5 8 3 18.7 360 15.050 357.6167
## 6 6 3 18.1 225 19.750 241.5000
The advantages of the shorthand are:
- The analyst only has to specify the grouping column once.
- The data (
mtcars) enters the pipeline only once. - The analyst doesn’t have to start thinking about joins immediately.
Frankly I’ve never liked the shorthand. I feel it is a "magic extra" that a new user would have no way of anticipating from common use of group_by() and summarize(). I very much like the idea of wrapping this important common use case into a single verb. Adjoining "windowed" or group-calculated columns is a common and important step in analysis, and well worth having its own verb.
Below is our attempt at elevating this pattern into a packaged verb.
#' Simulate the group_by/mutate pattern
#' with an explicit summarize and join.
#'
#' Group a data frame by the groupingVars
#' and compute user summaries on this data
#' frame (user summaries specified in ...),
#' then join these new columns back into
#' the original data and return to the
#' user. It is a demonstration of a
#' higher-order dplyr verb.
#'
#' Author: John Mount, Win-Vector LLC.
#'
#' @param d data.frame
#' @param groupingVars character vector of column names to group by.
#' @param ... dplyr::summarize commands.
#' @return d with grouped summaries added as extra columns
#'
#' @examples
#'
#' add_group_summaries(mtcars,
#' c("cyl", "gear"),
#' group_mean_mpg = mean(mpg),
#' group_mean_disp = mean(disp)) %>%
#' head()
#'
#' @export
#'
add_group_summaries <- function(d,
groupingVars,
...) {
# convert char vector into spliceable vector
groupingSyms <- rlang::syms(groupingVars)
d <- ungroup(d) # just in case
dg <- group_by(d, !!!groupingSyms)
ds <- summarize(dg, ...)
# work around https://github.com/tidyverse/dplyr/issues/2963
ds <- ungroup(ds)
left_join(d, ds, by= groupingVars)
}
This works as follows:
mtcars %>%
add_group_summaries(c("cyl", "gear"),
group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
select(cyl, gear, mpg, disp,
group_mean_mpg, group_mean_disp) %>%
head()
## cyl gear mpg disp group_mean_mpg group_mean_disp
## 1 6 4 21.0 160 19.750 163.8000
## 2 6 4 21.0 160 19.750 163.8000
## 3 4 4 22.8 108 26.925 102.6250
## 4 6 3 21.4 258 19.750 241.5000
## 5 8 3 18.7 360 15.050 357.6167
## 6 6 3 18.1 225 19.750 241.5000
And this also works on SQLite-backed dplyr data (which the shorthand currently does not, please see dplyr 2887 issue and dplyr issue 2960).
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2 %>%
group_by(cyl, gear) %>%
mutate(group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp))
## Error: Window function `avg()` is not supported by this database
mtcars2 %>%
add_group_summaries(c("cyl", "gear"),
group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
select(cyl, gear, mpg, disp,
group_mean_mpg, group_mean_disp) %>%
head()
## # Source: lazy query [?? x 6]
## # Database: sqlite 3.11.1 [:memory:]
## cyl gear mpg disp group_mean_mpg group_mean_disp
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6 4 21.0 160 19.750 163.8000
## 2 6 4 21.0 160 19.750 163.8000
## 3 4 4 22.8 108 26.925 102.6250
## 4 6 3 21.4 258 19.750 241.5000
## 5 8 3 18.7 360 15.050 357.6167
## 6 6 3 18.1 225 19.750 241.5000
The above, and many more useful dplyr standard evaluation adapters are now all part of the new package seplyr (on CRAN and GitHub).
Categories: Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
The code for this, and for an additional
group_summarize()verb can be found here.It took me a while to discover what group_by with mutate would do (no collapsing) and somehow found it unintuitive in an otherwise very intuitive package. By comparison, in Stata the equivalent:
by cyl gear: egen group_mean_mpg = mean(mpg)
does what you’d expect without collapsing as summarize in dplyr does.
I come from a
SQLbackground, so I find the non-collapsing case to be the odd one. My gut always says: build a table with the right keying and join it back in.However, running
dplyr::show_query()on the examples (this time using aPostgreSQLback end) is very illuminating. We see thesummarizecollapse triggered by theSQLgroup command, and the windowed calculation co-mingling with other results in themutateexample.suppressPackageStartupMessages(library("dplyr")) library('RPostgreSQL') #> Loading required package: DBI # Local Postgres.app database; no password by default # Of course, you fill in your own database information here. con = DBI::dbConnect(DBI::dbDriver("PostgreSQL"), host = 'localhost', port = 5432, user = 'postgres', password = 'pg') copy_to(con, mtcars %>% mutate(id = row_number()), 'mtcars') mtcars2 <- tbl(con, "mtcars") mtcars2 %>% select(id, mpg, cyl) %>% group_by(cyl) %>% mutate(mpg_min = min(mpg)) %>% show_query() #> <SQL> #> SELECT "id", "mpg", "cyl", min("mpg") OVER (PARTITION BY "cyl") AS "mpg_min" #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl" #> FROM "mtcars") "kkztfjtpuw" mtcars2 %>% select(id, mpg, cyl) %>% group_by(cyl) %>% summarize(mpg_min = min(mpg)) %>% show_query() #> <SQL> #> SELECT "cyl", MIN("mpg") AS "mpg_min" #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl" #> FROM "mtcars") "pdtcvcoqju" #> GROUP BY "cyl"# group_by() also works with mutate() eliminating the need for summarise() and join()
mtcars %>%
group_by(cyl, gear) %>%
mutate(group_mean_mpg = mean(mpg),
group_mean_disp = mean(disp)) %>%
head()
Yes, that is what I hoped to indicate with “The analyst doesn’t have to start thinking about joins immediately”.
And the shorthand works, until it does not.
suppressPackageStartupMessages(library("dplyr")) con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") copy_to(con, mtcars) mtcars2 <- tbl(con, "mtcars") mtcars2 %>% group_by(cyl, gear) %>% mutate(group_mean_mpg = mean(mpg), group_mean_disp = mean(disp)) %>% head() #> Error: Window function `avg()` is not supported by this databaseA thank-you to rlang author Lionel Henry for taking the time to point out we can use
rlang::symsfor the symbol creation. I have updated the code here and in GitHub.A similar pattern that I find myself often using to avoid a join is the group_by() %>% summarize(), where the grouping is over-specified. An example of this is a time series with columns year, month, date, value, (multiple values per day) and the analyst wants to summarize by date.
A simple group_by(date) %>% summarize(day_value = sum(value)) will drop columns that we want to retain (year, month), but can re-derive (or join). To avoid the hassle, there are multiple options, all sub-optimal:
“`
#Over-specify grouping
#Over-specify grouping
dta %>%
group_by(year, month, date) %>%
summarize(value = sum(value))
#Summarize extra columns
dta %>%
group_by(date) %>%
summarize(value = sum(value),
month = first(month),
year = first(year))
#Explicit join
dta %>%
group_by(date) %>%
summarize(value = sum(value)) %>%
left_join(dta[c(“year”, “month”, “date”)], by = “date”)
“`
Olivia B-R,
That is a great issue and a great example. Thanks for bringing it up (and your solutions).
To state it back: suppose are grouping by US zipcode and you have also a US State column. Under typical `group_by(zip_code) %>% summarize(age = mean(age))` type work you lose the (potentially useful) state column.
In `SQL` if you try to include such columns you usually get an error message of the nature “ERROR: column ‘d.state’ must appear in the GROUP BY clause or be used in an aggregate function.”
The solutions, as you state are usually one of:
Add the column to the “group by” to “overkey.”
Add the column to the query with a useless aggregator (like ‘min()’). We teach this one and suggest adding the comment “pseudo aggregator” to the code for readability.
Joining the columns back in from a fact table later.
As I said our favorite solution is the pseudo-aggegator.
library("DBI") library("RPostgreSQL") suppressPackageStartupMessages(library("dplyr")) conn <- dbConnect(dbDriver("PostgreSQL"), host = 'localhost', port = 5432, user = 'postgres', password = 'pg') dbWriteTable(conn, 'd', data.frame(state='CA', zipcode=94207, age=40, stringsAsFactors = FALSE), temporary=TRUE, overwite=TRUE) #> [1] TRUE DBI::dbGetQuery(conn, "select zipcode, min(state) state, -- pseudo aggregator, column is a function of keys avg(age) from d group by zipcode") #> zipcode state avg #> 1 94207 CA 40 d <- dplyr::tbl(conn, 'd') # loses state d %>% group_by(zipcode) %>% summarize(meanage = mean(age)) #> # Source: lazy query [?? x 2] #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] #> zipcode meanage #> <dbl> <dbl> #> 1 94207 40 # over keying d %>% group_by(zipcode, state) %>% summarize(meanage = mean(age)) #> # Source: lazy query [?? x 3] #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] #> # Groups: zipcode #> zipcode state meanage #> <dbl> <chr> <dbl> #> 1 94207 CA 40 # pseudo-aggregator d %>% group_by(zipcode) %>% summarize(meanage = mean(age), state= min(state) # pseudo aggregator, column is a function of keys ) #> # Source: lazy query [?? x 3] #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres] #> zipcode meanage state #> <dbl> <dbl> <chr> #> 1 94207 40 CA dbExecute(conn, 'drop table d') #> [1] 0Hi John,
Thank you for your response. The zip code example is a more natural one (well, other than for a handful of zips, see https://gis.stackexchange.com/a/167333).
My prefered method is the over-keying one. Could you explain why you prefer, and teach, the pseudo-aggregator instead?
I generally think of the group_by dimensions as the keys/coordinates of the data, and the summarize ones as my measurements. In that framework, I think it makes more sense to over-key, although I guess there is an argument that the first key is sufficient and that the state is a “measurement” done at the zipcode level.
My ideal solution would be to encode that key hierarchy in the data itself. I however cannot think of a nice design to allow the programmer to easily express this notion. I have been playing with the idea of using information in the join plan, we’ll see where that goes. I am very interested in other people’s thoughts on this.
Best,
Olivia
I should have figured there would be mixed-state zip codes (argh). It is just non-time series examples are simpler for those who have not worked with time series.
Over-keying is a great solution. I tend to use pseudo-aggergation because I come from a SQL world, so I am often in the middle of the join and the aggregation notation is often a bit more convenient than naming table-qualified columns (or even worse grouping by derived expressions).
I agree- more specification and having the code deal with it would be better. In Practical Data Science with R we didn’t talk about pseudo-aggregators, but did talk about column roles (keys, values) and had a concept of a “derived column.” A system like you are talking about could deal with work automatically if we had such annotations.