Menu Home

Better Grouped Summaries in dplyr

For R dplyr users one of the promises of the new rlang/tidyeval system is an improved ability to program over dplyr itself. In particular to add new verbs that encapsulate previously compound steps into better self-documenting atomic steps.

Let’s take a look at this capability.

First let’s start dplyr.

suppressPackageStartupMessages(library("dplyr"))
packageVersion("dplyr")
## [1] '0.7.1.9000'

A dplyr pattern that I have seen used often is the "group_by() %>% mutate()" pattern. This historically has been shorthand for a "group_by() %>% summarize()" followed by a join(). It is easiest to show by example.

The following code:

mtcars %>% 
  group_by(cyl, gear) %>%
  mutate(group_mean_mpg = mean(mpg), 
            group_mean_disp = mean(disp)) %>% 
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
## # A tibble: 6 x 6
## # Groups:   cyl, gear [4]
##     cyl  gear   mpg  disp group_mean_mpg group_mean_disp
##   <dbl> <dbl> <dbl> <dbl>          <dbl>           <dbl>
## 1     6     4  21.0   160         19.750        163.8000
## 2     6     4  21.0   160         19.750        163.8000
## 3     4     4  22.8   108         26.925        102.6250
## 4     6     3  21.4   258         19.750        241.5000
## 5     8     3  18.7   360         15.050        357.6167
## 6     6     3  18.1   225         19.750        241.5000

is taken to be shorthand for:

mtcars %>% 
  group_by(cyl, gear) %>%
  summarize(group_mean_mpg = mean(mpg), 
            group_mean_disp = mean(disp)) %>% 
  left_join(mtcars, ., by = c('cyl', 'gear')) %>%
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
##   cyl gear  mpg disp group_mean_mpg group_mean_disp
## 1   6    4 21.0  160         19.750        163.8000
## 2   6    4 21.0  160         19.750        163.8000
## 3   4    4 22.8  108         26.925        102.6250
## 4   6    3 21.4  258         19.750        241.5000
## 5   8    3 18.7  360         15.050        357.6167
## 6   6    3 18.1  225         19.750        241.5000

The advantages of the shorthand are:

  • The analyst only has to specify the grouping column once.
  • The data (mtcars) enters the pipeline only once.
  • The analyst doesn’t have to start thinking about joins immediately.

Frankly I’ve never liked the shorthand. I feel it is a "magic extra" that a new user would have no way of anticipating from common use of group_by() and summarize(). I very much like the idea of wrapping this important common use case into a single verb. Adjoining "windowed" or group-calculated columns is a common and important step in analysis, and well worth having its own verb.

Below is our attempt at elevating this pattern into a packaged verb.

#' Simulate the group_by/mutate pattern 
#' with an explicit summarize and join.
#' 
#' Group a data frame by the groupingVars
#' and compute user summaries on this data
#' frame (user summaries specified in ...),
#' then join these new columns back into
#' the original data and return to the
#' user.  It is a demonstration of a
#' higher-order dplyr verb.
#' 
#' Author: John Mount, Win-Vector LLC.
#' 
#' @param d data.frame
#' @param groupingVars character vector of column names to group by.
#' @param ... dplyr::summarize commands.
#' @return d with grouped summaries added as extra columns
#' 
#' @examples
#' 
#' add_group_summaries(mtcars, 
#'                     c("cyl", "gear"), 
#'                     group_mean_mpg = mean(mpg), 
#'                     group_mean_disp = mean(disp)) %>%
#'   head()
#' 
#' @export
#' 
add_group_summaries <- function(d, 
                                groupingVars, 
                                ...) {
  # convert char vector into spliceable vector
  groupingSyms <- rlang::syms(groupingVars)
  d <- ungroup(d) # just in case
  dg <- group_by(d, !!!groupingSyms)
  ds <- summarize(dg, ...)
  # work around https://github.com/tidyverse/dplyr/issues/2963
  ds <- ungroup(ds)
  left_join(d, ds, by= groupingVars)
}

This works as follows:

mtcars %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
##   cyl gear  mpg disp group_mean_mpg group_mean_disp
## 1   6    4 21.0  160         19.750        163.8000
## 2   6    4 21.0  160         19.750        163.8000
## 3   4    4 22.8  108         26.925        102.6250
## 4   6    3 21.4  258         19.750        241.5000
## 5   8    3 18.7  360         15.050        357.6167
## 6   6    3 18.1  225         19.750        241.5000

And this also works on SQLite-backed dplyr data (which the shorthand currently does not, please see dplyr 2887 issue and dplyr issue 2960).

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

mtcars2 %>% 
  group_by(cyl, gear) %>%
  mutate(group_mean_mpg = mean(mpg), 
         group_mean_disp = mean(disp))
## Error: Window function `avg()` is not supported by this database
mtcars2 %>% 
  add_group_summaries(c("cyl", "gear"), 
                      group_mean_mpg = mean(mpg), 
                      group_mean_disp = mean(disp)) %>%
  select(cyl, gear, mpg, disp, 
         group_mean_mpg, group_mean_disp) %>%
  head()
## # Source:   lazy query [?? x 6]
## # Database: sqlite 3.11.1 [:memory:]
##     cyl  gear   mpg  disp group_mean_mpg group_mean_disp
##   <dbl> <dbl> <dbl> <dbl>          <dbl>           <dbl>
## 1     6     4  21.0   160         19.750        163.8000
## 2     6     4  21.0   160         19.750        163.8000
## 3     4     4  22.8   108         26.925        102.6250
## 4     6     3  21.4   258         19.750        241.5000
## 5     8     3  18.7   360         15.050        357.6167
## 6     6     3  18.1   225         19.750        241.5000

The above, and many more useful dplyr standard evaluation adapters are now all part of the new package seplyr (on CRAN and GitHub).

Categories: Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

10 replies

  1. It took me a while to discover what group_by with mutate would do (no collapsing) and somehow found it unintuitive in an otherwise very intuitive package. By comparison, in Stata the equivalent:
    by cyl gear: egen group_mean_mpg = mean(mpg)
    does what you’d expect without collapsing as summarize in dplyr does.

    1. I come from a SQL background, so I find the non-collapsing case to be the odd one. My gut always says: build a table with the right keying and join it back in.

      However, running dplyr::show_query() on the examples (this time using a PostgreSQL back end) is very illuminating. We see the summarize collapse triggered by the SQL group command, and the windowed calculation co-mingling with other results in the mutate example.

      suppressPackageStartupMessages(library("dplyr"))
      library('RPostgreSQL')
      #> Loading required package: DBI
      
      # Local Postgres.app database; no password by default
      # Of course, you fill in your own database information here.
      con = DBI::dbConnect(DBI::dbDriver("PostgreSQL"),
                           host = 'localhost',
                           port = 5432,
                           user = 'postgres',
                           password = 'pg')
      
      copy_to(con, 
              mtcars %>% mutate(id = row_number()), 
              'mtcars')
      mtcars2 <- tbl(con, "mtcars")
      
      mtcars2 %>%
        select(id, mpg, cyl) %>%
        group_by(cyl) %>%
        mutate(mpg_min = min(mpg)) %>% 
        show_query()
      #> <SQL>
      #> SELECT "id", "mpg", "cyl", min("mpg") OVER (PARTITION BY "cyl") AS "mpg_min"
      #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl"
      #> FROM "mtcars") "kkztfjtpuw"
      
      mtcars2 %>%
        select(id, mpg, cyl) %>%
        group_by(cyl) %>%
        summarize(mpg_min = min(mpg)) %>% 
        show_query()
      #> <SQL>
      #> SELECT "cyl", MIN("mpg") AS "mpg_min"
      #> FROM (SELECT "id" AS "id", "mpg" AS "mpg", "cyl" AS "cyl"
      #> FROM "mtcars") "pdtcvcoqju"
      #> GROUP BY "cyl"
      
  2. # group_by() also works with mutate() eliminating the need for summarise() and join()

    mtcars %>%
    group_by(cyl, gear) %>%
    mutate(group_mean_mpg = mean(mpg),
    group_mean_disp = mean(disp)) %>%
    head()

    1. Yes, that is what I hoped to indicate with “The analyst doesn’t have to start thinking about joins immediately”.

      And the shorthand works, until it does not.

      suppressPackageStartupMessages(library("dplyr"))
      con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
      copy_to(con, mtcars)
      mtcars2 <- tbl(con, "mtcars")
      
      mtcars2 %>%
        group_by(cyl, gear) %>%
        mutate(group_mean_mpg = mean(mpg),
               group_mean_disp = mean(disp)) %>%
        head()
      #> Error: Window function `avg()` is not supported by this database
      
  3. A similar pattern that I find myself often using to avoid a join is the group_by() %>% summarize(), where the grouping is over-specified. An example of this is a time series with columns year, month, date, value, (multiple values per day) and the analyst wants to summarize by date.

    A simple group_by(date) %>% summarize(day_value = sum(value)) will drop columns that we want to retain (year, month), but can re-derive (or join). To avoid the hassle, there are multiple options, all sub-optimal:

    “`
    #Over-specify grouping
    #Over-specify grouping
    dta %>%
    group_by(year, month, date) %>%
    summarize(value = sum(value))

    #Summarize extra columns
    dta %>%
    group_by(date) %>%
    summarize(value = sum(value),
    month = first(month),
    year = first(year))

    #Explicit join
    dta %>%
    group_by(date) %>%
    summarize(value = sum(value)) %>%
    left_join(dta[c(“year”, “month”, “date”)], by = “date”)
    “`

    1. Olivia B-R,

      That is a great issue and a great example. Thanks for bringing it up (and your solutions).

      To state it back: suppose are grouping by US zipcode and you have also a US State column. Under typical `group_by(zip_code) %>% summarize(age = mean(age))` type work you lose the (potentially useful) state column.

      In `SQL` if you try to include such columns you usually get an error message of the nature “ERROR: column ‘d.state’ must appear in the GROUP BY clause or be used in an aggregate function.”

      The solutions, as you state are usually one of:

      Add the column to the “group by” to “overkey.”
      Add the column to the query with a useless aggregator (like ‘min()’). We teach this one and suggest adding the comment “pseudo aggregator” to the code for readability.
      Joining the columns back in from a fact table later.

      As I said our favorite solution is the pseudo-aggegator.

      library("DBI")
      library("RPostgreSQL")
      suppressPackageStartupMessages(library("dplyr"))
      
      conn <- dbConnect(dbDriver("PostgreSQL"),
                        host = 'localhost',
                        port = 5432,
                        user = 'postgres',
                        password = 'pg')
      
      dbWriteTable(conn, 'd', 
                   data.frame(state='CA', zipcode=94207, age=40,
                              stringsAsFactors = FALSE),
                   temporary=TRUE, overwite=TRUE)
      #> [1] TRUE
      DBI::dbGetQuery(conn, "select 
                                zipcode, 
                                min(state) state, -- pseudo aggregator, column is a function of keys
                                avg(age) 
                             from 
                                d 
                             group by
                                zipcode")
      #>   zipcode state avg
      #> 1   94207    CA  40
      d <- dplyr::tbl(conn, 'd')
      
      # loses state
      d %>% 
        group_by(zipcode) %>% 
        summarize(meanage = mean(age))
      #> # Source:   lazy query [?? x 2]
      #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
      #>   zipcode meanage
      #>     <dbl>   <dbl>
      #> 1   94207      40
      
      # over keying
      d %>% 
        group_by(zipcode, state) %>% 
        summarize(meanage = mean(age))
      #> # Source:   lazy query [?? x 3]
      #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
      #> # Groups:   zipcode
      #>   zipcode state meanage
      #>     <dbl> <chr>   <dbl>
      #> 1   94207    CA      40
      
      # pseudo-aggregator
      d %>% 
        group_by(zipcode) %>% 
        summarize(meanage = mean(age), 
                  state= min(state) #  pseudo aggregator, column is a function of keys
        )
      #> # Source:   lazy query [?? x 3]
      #> # Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
      #>   zipcode meanage state
      #>     <dbl>   <dbl> <chr>
      #> 1   94207      40    CA
      
      dbExecute(conn, 'drop table d')
      #> [1] 0
      
      1. Hi John,

        Thank you for your response. The zip code example is a more natural one (well, other than for a handful of zips, see https://gis.stackexchange.com/a/167333).

        My prefered method is the over-keying one. Could you explain why you prefer, and teach, the pseudo-aggregator instead?

        I generally think of the group_by dimensions as the keys/coordinates of the data, and the summarize ones as my measurements. In that framework, I think it makes more sense to over-key, although I guess there is an argument that the first key is sufficient and that the state is a “measurement” done at the zipcode level.

        My ideal solution would be to encode that key hierarchy in the data itself. I however cannot think of a nice design to allow the programmer to easily express this notion. I have been playing with the idea of using information in the join plan, we’ll see where that goes. I am very interested in other people’s thoughts on this.

        Best,
        Olivia

      2. I should have figured there would be mixed-state zip codes (argh). It is just non-time series examples are simpler for those who have not worked with time series.

        Over-keying is a great solution. I tend to use pseudo-aggergation because I come from a SQL world, so I am often in the middle of the join and the aggregation notation is often a bit more convenient than naming table-qualified columns (or even worse grouping by derived expressions).

        I agree- more specification and having the code deal with it would be better. In Practical Data Science with R we didn’t talk about pseudo-aggregators, but did talk about column roles (keys, values) and had a concept of a “derived column.” A system like you are talking about could deal with work automatically if we had such annotations.

%d bloggers like this: