My favorite R data.table feature is the “
by” grouping notation when combined with the
Let’s take a look at this powerful notation.
First, let’s build an example
d <- wrapr::build_frame( "group" , "value" | "a" , 1L | "a" , 2L | "b" , 3L | "b" , 4L ) knitr::kable(d)
The data is some sort of value with a grouping column telling us which rows are related.
by” notation we can add the per-group totals into each row of the data as follows (the extra
 at the end is just the command to also print the results in addition to adding the column in-place).
library("data.table") dt <- data.table::as.data.table(d) dt[, group_sum := sum(value), by = "group"] # group value group_sum # 1: a 1 3 # 2: a 2 3 # 3: b 3 7 # 4: b 4 7
by” signals we are doing a per-group calculation, and the “
:=” signals to land the results in the original
data.table. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.
# build a fresh copy as last command altered dt in place dt <- data.table::as.data.table(d) dt[, fraction := value/sum(value), by = "group"] # group value fraction # 1: a 1 0.3333333 # 2: a 2 0.6666667 # 3: b 3 0.4285714 # 4: b 4 0.5714286
In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.
sums <- tapply(d$value, d$group, sum) d$fraction <- d$value/sums[d$group] print(d) # group value fraction # 1 a 1 0.3333333 # 2 a 2 0.6666667 # 3 b 3 0.4285714 # 4 b 4 0.5714286
We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.
SELECT group, value, value/sum(value) OVER ( PARTITION BY group ) AS fraction FROM d
And the popular package
dplyr uses the following notation for the same problem.
d %>% group_by(group) %>% mutate(fraction = value/sum(value)) %>% ungroup()
And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.
## solution milliseconds ## datatable_soln 384 ## base_R_lookup_soln 1476 ## dplyr_soln 3988
All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the
data.table time includes the time to convert to the
data.table class.Details on the timings can be found here.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
I would have thought to use ave() for a base R solution:
Seems to be slightly slower than the base R lookup solution.
Thanks for the note!
ave()is a good “all in one” solution for merging per-group summaries in, and handles multiple grouping columns (or composite keys). But I wanted to show how one gets to a solution from basic ideas (compute summary, merge summary back in). I did not avoid it for speed reasons, and
print(ave)is fairly interesting (though a bit hard to work out all the details, such as the assignment back into
Actually I just re-ran the timings with
ave()is competitive with the merge solution.