If your R
or dplyr
work is taking what you consider to be a too long (seconds instead of instant, or minutes instead of seconds, or hours instead of minutes, or a day instead of an hour) then try data.table
.
For some tasks data.table
is routinely faster than alternatives at pretty much all scales (example timings here).
If your project is large (millions of rows, hundreds of columns) you really should rent an an Amazon EC2 r4.8xlarge (244 GiB RAM) machine for an hour for about $2.13 (quick setup instructions here) and experience speed at scale.
Categories: Opinion Pragmatic Data Science Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
if i have a need for a 244 GiB RAM machine that means i have say, 100-400 gigabytes of data? how do i get that much data up to an AWS machine? it might actually be faster to fedex AWS an sd-card or usb stick than try to upload it to them over the net. can i do that?
There are bigger instances (such as r4.16xlarge which is 488 GiB for about $4.26/hour). The usual way to deal with data is to stage it into an Amazon S3 bucket (which is fairly reasonably priced storage).
However at some point it does get hard to use a single machine’s memory, so you need out of core methods or cluster solutions. In particular that is when you give Apache Spark a whirl. R can talk to Spark through sparklyr/dplyr or through sparklyr/rquery or even SparkR/rquery.
right now ive got the data on a local server; 64GB ram, 2x1TB ssd(s) raid running raid zero (so effectivly a 2TB ssd) with data backed up to 8TB of spinning rust disk.i do most of the modelling in kdb (actually raw k), and most of the statistics analysis of the model results in R. Both kdb and R like lots of memory and i am often memory constrained at 64GB.
its not the monthly cost of AWS S3 storage or the hourly cost of a big memory instance that concerns me, its the 36 to 60 hours that it would take to upload 400 GB over the net … theres got to be a better way to upload than the net.
The upload is indeed going to be a big problem, and kdb is a very good system.
My simulations unfortunately do not experience the upload time as they use synthetic data built at the server. But they do show the relative speed of several R solutions once one “gets into steady state” (has the data and machine in the same place).
My advice is mostly for
R
-users who do not yet have a working large data strategy. Not trying to move people away from functioning workflows.Hi John
Thanks for your post!
I routinely push 120GB txt files through R. Came across your post and converted my dplyr wrangle to data.table.
GREAT performance boost!
That is great news! Thanks for sharing!
For everybody else:
In general I suggest first porting a small portion of your workflow to confirm if the tasks you are performing are going to speed up. Definitely do not perform laborious porting without a pilot study to estimate the benefit. This is part of why I use synthetic examples in my benchmarking (that plus client data larger and confidential). As a rule: if you are using any sort of grouped calculation, sorting, or window-functions you may see a significant speedup with data.table. And the usual reminder: optimizing without profiling can be risky- you probably want to know where your time is going before porting.
More benchmarks here!