The graph is using a log-log scale (so things are very compressed). But
data.table is routinely 7 times faster than
dplyr. The ratio of run times is shown below.
Notice on the above semi-log plot the run time ratio is growing roughly linearly. This makes sense:
data.table uses a radix sort which has the potential to perform in near linear time (faster than the
n log(n) lower bound known comparison sorting) for a range of problems (also we are only showing example sorting times, not worst-case sorting times).
In fact, if we divide the
y in the above graph by
log(rows) we get something approaching a constant.
The above is consistent with
data.table not only being faster than
dplyr, but also having a fundamentally different asymptotic running time.
Performance like the above is one of the reasons you should strongly consider
data.table for your
All details of the timings can be found here.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.