I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.

Roughly, the task was to add in some derived per-group aggregation columns to a few million row data set. In the application the groups tend to be small session logs from many users. So the groups are numerous and small.

We can create an abstract version of such data in R as follows.

```
```set.seed(2020)
n <- 1000000
mk_data <- function(n) {
d <- data.frame(x = rnorm(n))
d$g <- sprintf("level_%09g",
sample.int(n, size = n, replace = TRUE))
return(d)
}
d <- mk_data(n)

The sampling with replacement has an expected number of unique IDs in the ballpark of `n/log(n)`

via the coupon collector’s problem. So we expect lots of small groups in such data.

Our task can be specified in rquery/rqdatatable notation as follows.

```
```library(rqdatatable)
ops_rqdatatable <- local_td(d, name = 'd') %.>%
extend(.,
rn %:=% row_number(),
cs %:=% cumsum(x),
partitionby = 'g',
orderby = 'x') %.>%
order_rows(.,
c('g', 'x'))

The key step is the `extend()`

, which adds the new columns `rn`

and `cs`

in a per-`g`

group manner in a by-`x`

order. We feel the notation is learnable and expressive. (Note: normally we would use `:=`

for assignment, but as we are also running direct data.table examples we didn’t load this operator and instead used `%:=%`

to stay out of data.table’s way.)

We translated the same task in to several different notations: rquery, rqdatatable, data.table, dplyr, dtplyr, and data_algebra. The observed task times are given below.

Method | Interface Language | Data Engine | Mean run time in seconds |
---|---|---|---|

data.table | R | data.table | 2.1 |

rqdatatable | R | data.table | 3.8 |

rquery | R | SQLite | 6.6 |

dplyr | R | dplyr | 35.1 |

dtplyr | R | data.table | 5.1 |

dbplyr | R | SQLite | 4.5 |

data_algebra | Python | Pandas | 14.7 |

data_algebra | Python | SQLite | 9.6 |

What is missing is a direct Pandas timing (to confirm if the length of the Python run-time is from data_algebra overhead or from the underlying Pandas engine).

What stands out is how fast data.table, and even the data.table based methods, are compared to all other methods. Edit 1-3-2019, we have added some SQLite based database timings, which are also generally fast.

Details of the benchmark runs (methods, code, data, versions, and so on) can be found here.

Categories: data science Opinion Pragmatic Data Science Tutorials

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

If anyone wants to submit what they consider good idiomatic base-R or direct-Pandas code for the task I can add these to my timings. My guess is base-R is going to be slow unless ideas like these are used. For Pandas if one goes through the

`.transform()`

interface I expect timings not much faster than we saw here (but I could be wrong, I haven’t tried direct Pandas on this example).