Menu Home

data.table is Really Good at Sorting

The data.table R package is really good at sorting. Below is a comparison of it versus dplyr for a range of problem sizes.

Present 2

The graph is using a log-log scale (so things are very compressed). But data.table is routinely 7 times faster than dplyr. The ratio of run times is shown below.

Present 3

Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense: data.table uses a radix sort which has the potential to perform in near linear time (faster than the n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).

In fact, if we divide the y in the above graph by log(rows) we get something approaching a constant.

Present 4

The above is consistent with data.table not only being faster than dplyr, but also having a fundamentally different asymptotic running time.

Performance like the above is one of the reasons you should strongly consider data.table for your R projects.

All details of the timings can be found here.

Categories: Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

2 replies

  1. Just a note: data.table‘s timings include converting into and out of data.table‘s format (this is so microbenchmark can repeat the operation and part of checking for identical results). Obviously an actual data.table task would not do this (or at most do it once, and not at each operation). So data.table is actually even faster in meaningful tasks than depicted by the single stage test depicted here.

%d bloggers like this: