The data.table
R
package is really good at sorting. Below is a comparison of it versus dplyr
for a range of problem sizes.
The graph is using a log-log scale (so things are very compressed). But data.table
is routinely 7 times faster than dplyr
. The ratio of run times is shown below.
Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table
uses a radix sort which has the potential to perform in near linear time (faster than the n log(n)
lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).
In fact, if we divide the y
in the above graph by log(rows)
we get something approaching a constant.
The above is consistent with data.table
not only being faster than dplyr
, but also having a fundamentally different asymptotic running time.
Performance like the above is one of the reasons you should strongly consider data.table
for your R
projects.
All details of the timings can be found here.
Categories: Opinion
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
People with CS background and use R in production environment usually prefer data.table to dplyr [edited]
Just a note:
data.table
‘s timings include converting into and out ofdata.table
‘s format (this is somicrobenchmark
can repeat the operation and part of checking for identical results). Obviously an actualdata.table
task would not do this (or at most do it once, and not at each operation). Sodata.table
is actually even faster in meaningful tasks than depicted by the single stage test depicted here.