My favorite R data.table feature is the “by
” grouping notation when combined with the :=
notation.
Let’s take a look at this powerful notation.
First, let’s build an example data.frame
.
d <- wrapr::build_frame( "group" , "value" | "a" , 1L | "a" , 2L | "b" , 3L | "b" , 4L ) knitr::kable(d)
group | value |
---|---|
a | 1 |
a | 2 |
b | 3 |
b | 4 |
The data is some sort of value with a grouping column telling us which rows are related.
With the data.table
“:=
,by
” notation we can add the per-group totals into each row of the data as follows (the extra []
at the end is just the command to also print the results in addition to adding the column in-place).
library("data.table") dt <- data.table::as.data.table(d) dt[, group_sum := sum(value), by = "group"][] # group value group_sum # 1: a 1 3 # 2: a 2 3 # 3: b 3 7 # 4: b 4 7
The “by
” signals we are doing a per-group calculation, and the “:=
” signals to land the results in the original data.table
. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.
# build a fresh copy as last command altered dt in place dt <- data.table::as.data.table(d) dt[, fraction := value/sum(value), by = "group"][] # group value fraction # 1: a 1 0.3333333 # 2: a 2 0.6666667 # 3: b 3 0.4285714 # 4: b 4 0.5714286
In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.
sums <- tapply(d$value, d$group, sum) d$fraction <- d$value/sums[d$group] print(d) # group value fraction # 1 a 1 0.3333333 # 2 a 2 0.6666667 # 3 b 3 0.4285714 # 4 b 4 0.5714286
We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.
SELECT group, value, value/sum(value) OVER ( PARTITION BY group ) AS fraction FROM d
And the popular package dplyr
uses the following notation for the same problem.
d %>% group_by(group) %>% mutate(fraction = value/sum(value)) %>% ungroup()
And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.
## solution milliseconds ## datatable_soln 384 ## base_R_lookup_soln 1476 ## dplyr_soln 3988
All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the data.table
time includes the time to convert to the data.table
class.Details on the timings can be found here.
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
I would have thought to use ave() for a base R solution:
Seems to be slightly slower than the base R lookup solution.
Thanks for the note!
ave()
is a good “all in one” solution for merging per-group summaries in, and handles multiple grouping columns (or composite keys). But I wanted to show how one gets to a solution from basic ideas (compute summary, merge summary back in). I did not avoid it for speed reasons, andprint(ave)
is fairly interesting (though a bit hard to work out all the details, such as the assignment back intoslice()
).Actually I just re-ran the timings with
ave()
.ave()
is competitive with the merge solution.