R package is really good at sorting. Below is a comparison of it versus
dplyr for a range of problem sizes.
The graph is using a log-log scale (so things are very compressed). But
data.table is routinely 7 times faster than
dplyr. The ratio of run times is shown below.
Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense:
data.table uses a radix sort which has the potential to perform in near linear time (faster than the
n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).
In fact, if we divide the
y in the above graph by
log(rows) we get something approaching a constant.
The above is consistent with
data.table not only being faster than
dplyr, but also having a fundamentally different asymptotic running time.
Performance like the above is one of the reasons you should strongly consider
data.table for your
All details of the timings can be found here.
Categories: Opinion Programming
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
People with CS background and use R in production environment usually prefer data.table to dplyr [edited]
Just a note:
data.table‘s timings include converting into and out of
data.table‘s format (this is so
microbenchmarkcan repeat the operation and part of checking for identical results). Obviously an actual
data.tabletask would not do this (or at most do it once, and not at each operation). So
data.tableis actually even faster in meaningful tasks than depicted by the single stage test depicted here.