# Timing Grouped Mean Calculation in R

This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement.

The original published timings were as follows:

With performance metrics: measurements are marketing. So let’s dig in the above a bit.

These timings are of the kind of small task large number of repetition breed that Matt Dowle writes against. So they at first wouldn’t seem that decisive. Except, look at the following:

• At the time of our reading the example and methods were not shared. To reproduce the work we will need to make our own example (which we share here).
• The timings are not relative to any other package or system (base-`R`, `data.table`, and `Pandas` being three obvious choices), so may have trouble valuing the results.
• The time reported for `dplyr` on the `sum()/n()` examples is over a second to process 10,000 rows. This is unbelievably slow, and we fail to reproduce it in our run.
• The time reported for `dplyr` on the `mean()` examples is about 0.01 seconds. This is a plausible time for this task (about 3 times as long as `data.table` would take). But it is much faster than is typical for `dplyr`. We fail to reproduce it in our run, we see `dplyr` taking closer to 0.06 seconds on this task (or about five times slower).

Let’s try to reproduce these timings on a 2018 Dell XPS 13 Intel Core i5, 16GB Ram running Ubuntu 18.04, and also compare to some other packages: `data.table` and `rqdatatable`.

In this reproduction attempt we see:

• The `dplyr` time being around 0.05 seconds. This is about 5 times slower than claimed.
• The `dplyr` `sum()/n()` time is about 0.2 seconds, about 5 times faster than claimed.
• The `data.table` time being around 0.004 seconds. This is about three times as fast as the `dplyr` claims, and over ten times as fast as the actual observed `dplyr` behavior.

However, Matt Dowle is also right: comparing at this scale doesn’t tell half the story we see when we try to summarize 10,000,000 rows down to 1,000,000. At this scale `data.table` still takes under half a second (time not much worth arguing over), yet `dplyr` takes 10 to 24 seconds! Or that `dplyr` is no faster than `base::tapply()` (despite many claims to the contrary).

All code for this benchmark is available here and here.

Categories: Coding Opinion

Tagged as:

### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.