Menu Home

My Favorite data.table Feature

My favorite R data.table feature is the “by” grouping notation when combined with the := notation.

Let’s take a look at this powerful notation.

First, let’s build an example data.frame.

d <- wrapr::build_frame(
   "group"  , "value" |
     "a"    , 1L      |
     "a"    , 2L      |
     "b"    , 3L      |
     "b"    , 4L      )

knitr::kable(d)
group value
a 1
a 2
b 3
b 4

The data is some sort of value with a grouping column telling us which rows are related.

With the data.table:=,by” notation we can add the per-group totals into each row of the data as follows (the extra [] at the end is just the command to also print the results in addition to adding the column in-place).

library("data.table")
dt <- data.table::as.data.table(d)

dt[, group_sum := sum(value), by = "group"][]

#    group value group_sum
# 1:     a     1         3
# 2:     a     2         3
# 3:     b     3         7
# 4:     b     4         7

The “by” signals we are doing a per-group calculation, and the “:=” signals to land the results in the original data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.

# build a fresh copy as last command altered dt in place
dt <- data.table::as.data.table(d) 

dt[, fraction := value/sum(value), by = "group"][]

#    group value  fraction
# 1:     a     1 0.3333333
# 2:     a     2 0.6666667
# 3:     b     3 0.4285714
# 4:     b     4 0.5714286

In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.

sums <- tapply(d$value, d$group, sum)
d$fraction <- d$value/sums[d$group]
print(d)

#   group value  fraction
# 1     a     1 0.3333333
# 2     a     2 0.6666667
# 3     b     3 0.4285714
# 4     b     4 0.5714286

We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.

 SELECT
  group,
  value,
  value/sum(value) OVER (  PARTITION BY group ) AS fraction
 FROM
   d

And the popular package dplyr uses the following notation for the same problem.

d %>% 
    group_by(group) %>%
    mutate(fraction = value/sum(value)) %>%
    ungroup()

And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.

##            solution  milliseconds 
##      datatable_soln           384
##  base_R_lookup_soln          1476
##          dplyr_soln          3988

All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the data.table time includes the time to convert to the data.table class.Details on the timings can be found here.

Categories: Opinion Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. I would have thought to use ave() for a base R solution:

    base_R_ave_soln <- function(d) {
      d$fraction <- with(d, ave(value, group, FUN = function(x) x / sum(x)))
      d
    }

    Seems to be slightly slower than the base R lookup solution.

    Like

    1. Thanks for the note! ave() is a good “all in one” solution for merging per-group summaries in, and handles multiple grouping columns (or composite keys). But I wanted to show how one gets to a solution from basic ideas (compute summary, merge summary back in). I did not avoid it for speed reasons, and print(ave) is fairly interesting (though a bit hard to work out all the details, such as the assignment back into slice()).

      Actually I just re-ran the timings with ave(). ave() is competitive with the merge solution.

      Like

%d bloggers like this: