# My Favorite data.table Feature

My favorite R data.table feature is the “`by`” grouping notation when combined with the `:=` notation.

Let’s take a look at this powerful notation.

First, let’s build an example `data.frame`.

```d <- wrapr::build_frame(
"group"  , "value" |
"a"    , 1L      |
"a"    , 2L      |
"b"    , 3L      |
"b"    , 4L      )

knitr::kable(d)
```
group value
a 1
a 2
b 3
b 4

The data is some sort of value with a grouping column telling us which rows are related.

With the `data.table``:=`,`by`” notation we can add the per-group totals into each row of the data as follows (the extra `[]` at the end is just the command to also print the results in addition to adding the column in-place).

```library("data.table")
dt <- data.table::as.data.table(d)

dt[, group_sum := sum(value), by = "group"][]

#    group value group_sum
# 1:     a     1         3
# 2:     a     2         3
# 3:     b     3         7
# 4:     b     4         7
```

The “`by`” signals we are doing a per-group calculation, and the “`:=`” signals to land the results in the original `data.table`. This sort of window function is incredibly useful in computing things such as what fraction of a group’s mass is in each row. For example.

```# build a fresh copy as last command altered dt in place
dt <- data.table::as.data.table(d)

dt[, fraction := value/sum(value), by = "group"][]

#    group value  fraction
# 1:     a     1 0.3333333
# 2:     a     2 0.6666667
# 3:     b     3 0.4285714
# 4:     b     4 0.5714286
```

In base R (or in a more purely relational data system) the obvious way to solve this requires two steps: computing the per-group summaries and then joining them back into the original table rows. This can be done as follows.

```sums <- tapply(d\$value, d\$group, sum)
d\$fraction <- d\$value/sums[d\$group]
print(d)

#   group value  fraction
# 1     a     1 0.3333333
# 2     a     2 0.6666667
# 3     b     3 0.4285714
# 4     b     4 0.5714286
```

We called the transform a “window function”, as that is the name that SQL uses for the concept. The SQL code to perform this calculation would look like the following.

``` SELECT
group,
value,
value/sum(value) OVER (  PARTITION BY group ) AS fraction
FROM
d
```

And the popular package `dplyr` uses the following notation for the same problem.

```d %>%
group_by(group) %>%
mutate(fraction = value/sum(value)) %>%
ungroup()
```

And, as always, let’s end with some timings. For a 1000000 row table with 10 additional irrelevant columns, and group ids picked uniformly from 100000 symbols: we see the various solutions take the following times to complete the task.

```##            solution  milliseconds
##      datatable_soln           384
##  base_R_lookup_soln          1476
##          dplyr_soln          3988
```

All packages are the current CRAN releases as of 2019-06-29. Timings are sensitive to number of row, columns, and groups. Note the `data.table` time includes the time to convert to the `data.table` class.Details on the timings can be found here.

Categories: Opinion Tutorials

Tagged as: ### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 2 replies ›

1. clelandcm says:

I would have thought to use ave() for a base R solution:

```base_R_ave_soln <- function(d) {
d\$fraction <- with(d, ave(value, group, FUN = function(x) x / sum(x)))
d
}```

Seems to be slightly slower than the base R lookup solution.

Like

1. John Mount says:

Thanks for the note! `ave()` is a good “all in one” solution for merging per-group summaries in, and handles multiple grouping columns (or composite keys). But I wanted to show how one gets to a solution from basic ideas (compute summary, merge summary back in). I did not avoid it for speed reasons, and `print(ave)` is fairly interesting (though a bit hard to work out all the details, such as the assignment back into `slice()`).

Actually I just re-ran the timings with `ave()`. `ave()` is competitive with the merge solution.

Like