In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with
We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is critical in planning code, so here we examine some common simple cases. It is often impractical to port large applications between different work-paradigms, so we use porting small tasks as approximate stand-ins for measuring porting whole systems.
We tend to work with medium size data (hundreds of columns and millions of rows in memory), so that is the scale we simulate and study.
We will time the above tasks both using base R, and
dplyr 0.8.1. We will actually perform the timings using the
tibble variation of
data.frames, as this (hopefully) should not affect the base-
R timings, and may potentially help
dplyr (and we want to time each system “used well”).
Below are the results in graphical form (images link through to the full code of the experiments).
We won’t go deep into the details of the timings, as we are just interested in relative performance, except to say:
- We vary number of rows or number of columns as appropriate for each experiment, but not both at the same time.
- All data structures were
tibble, and we tended to force a
gc()before task timing. We will say
tibbles are a roughly a sub-class of
data.frames, and many
Rusers are interested in the performance of
- Each measurement was repeated to try and get stable estimates and see varation. However, what we are trying to estimate is the cost of performing one of these
tasks for the first time. So if we had noticed a case where the first run was systematically different than repetitions we would have re-structured the code to simulate a first-run multiple times by re-freshing the data multiple times. This didn’t occur in this set of measurements, but we have used the method in related studies, so it was part of our measurement intent.
- All runs were with a current version of R 3.6.0, all packages up to date, on an un-loaded late 2014 Mac Mini with 8 GB of RAM. More details are in the linked materials.
- As we are running a great range of examples, all graphs are presented in a log-log scale. This is a legibility trade-off; log-log scale compresses (or obscures) variation, but this crushing is just about the only thing that makes anything but the large examples visible when plotting a large range of examples jointly. In a log-log plot: small changes in height can in fact be large effects.
- As always, these are initial, not terminal studies. If you feel a variation is important, feel free to copy and alter the code to run your own study.
Selecting the first column from a
First we look at the time to select the first column from a data frame, with time to complete the task plotted as a function of the number of columns in the task.
For both the base-
R bracket notation and for
dplyr::select() the time is growing with the number of columns. In both cases this is undesirable, but can be excused as it may not be a good design trade-off to maintain fast column lookup structures: column names may change and column names can repeat. However, the ranges of behavior differ: base-
R is slowing down for larger tasks, but is always fast.
dplyr::select() is taking many seconds on the larger tasks.
The ratio of the time for
dplyr::select() to complete the task over the time it takes base-
R to complete the task is given here.
The take-away being:
dplyr::select() is routinely thousands of times slower than
R itself, and the times are substantial for large data.
Altering the first column in a
As one would expect the time to alter a the first column in a
data.frame shows similar behavior to the time to select the first column.
(We apologize for swapping colors in the above graph, it is due to us using the alpha order as the color order.)
Selecting the first row from a
Selecting the first row from a
data.frame can be fast due to the row indexing structures. It may or may not be constant time as the indexing structures may be non-trivial keys (not just integers). Similarly, picking a row from a more general database requires a full scan, unless there are index keys.
However, selecting the first row by index should in fact be easy. Let’s look at the timings.
R does achieve constant time access to the first row when the row indices are simple integers.
dplyr::slice() shows time-growth, even though
dplyr::slice() is defined to work with indices (so has a good chance of being constant time). These timings are faster than the column examples, as both systems are designed to work with many rows.
We did not time
data.table, as it is known to be very fast both on trivial and non-trivial workloads (at all scales). Here are some shared
data.table timings. If you are working with data at even medium scale, we strongly recommend
If you are working in a data set size range where the above effects are present, and you are unsatisfied with how long your analyses are taking to run: you may want to look into timing both
dplyr and base-
R variations of small extracts of your work pattern to see what the trade-offs of using either methodology are. The point is: even if your code is currently taking a long time to run there may be an easy variation of it that is in fact fast.
R is not intrinsically slow in working with data.
If you are not seeing slow results the above issues can be safely ignored.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.