According to a KDD poll fewer respondents (by rate) used only R
in 2017 than in 2016. At the same time more respondents (by rate) used only Python
in 2017 than in 2016.
Let’s take this as an excuse to take a quick look at what happens when we try a task in both systems.
For our task we picked the painful exercise of directly reading a 50,000,000 row by 50 column data set into memory on a machine with only 8GB of ram.
In Python
the Pandas
package takes around 6 minutes to read the data, and then one is ready to work.
In R
both utils::read.csv()
and readr::read_csv()
fail with out of memory messages. So if your view of R
is “base R
only”, or “base R
plus tidyverse
only”, or “tidyverse
only”: reading this file is a “hard task.”
With the above narrow view one would have no choice but to move to Python
if one wants to get the job done.
Or, we could remember data.table
. While data.table
is obviously not part of the tidyverse
, data.table
has been a best-practice in R
for around 12 years. It can read the data and is ready to work in R
in under a minute.
In conclusion, to get things done in a pinch: learn Python
or learn data.table
. And, in my opinion, “tidyverse
first teaching” (commonly code for “tidyverse
only teaching”) may not serve the R
community well in the long run.
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Related note: R on track to have fewer new packages in 2018 than in 2017.
First off, It’s nice to see that there are still people flying the banner of data.table in the R community. Thanks!
Since pandas is similar to the tidyverse in terms of CPU and memory (in)efficiency it would be nice to see how data.table for python fares in your test. Would that be possible?
I love what you are doing here :) When I started with R, I was confused like hell because the course on Coursera (Data Science with R specialization) gave a basic intro into R and then introduced dplyr. In my opinion it is a bad idea to teach R this way. dplyrs syntax is so far away from R that trying to learn both at the same time leads to severe confusion. I also think it is better to learn base R first to really understand the ways data can be manipulated. When I did the dplyr stuff, many of the transformations where pure magic of which I had no understanding. When I do stuff in base R, it might be more verbose and sometimes difficult to do, but at least I know exactly what is going on.
I really like the dplyr syntax (with the caveat of using explicit dot notation). I just think one must teach R when teaching R, and the tidyverse isn’t the universe.
We had an eye-opening experience teaching R to scientists. We got them up on working with R and solving their science/statistics problems with some skill. We started with base-R as, when handled correctly it is quite teachable, very comprehendible, and easy to reason about (little magic).
Then in day 2 we tried to introduce dplyr to reduce some pain points- and the audience didn’t connect at all. They thought the base solutions were good enough and since they were not interested in programming as a primary topic (a legitimate position) didn’t think the benefits were worth touching the related material twice. We were actually surprised by the reaction. Every programming audience blitzes through the base-R material (as it is familiar) and then likes the dplyr material.
Sounds familar!
I learned and used Matlab as my first language for quite some time before I switched to R after leaving academia. Early in my R journey I actually disliked to select columns in a data frame by names, since I was used to select them by their indices. Similarly I still kind of prefer to select rows in a df by explicitely computing the relevant indices. Frequently I find myself writing code like this:
idcs = mydf$colx ==
result = doSomething(mydf[idcs, c(“colx”, “coly”])
For me, this is a much more natural way to think about data, i.e. as sub-blocks in a data frame. I don’t like to just filter(). Somewhat related, I use pyspark from time to time and find it cumbersome to emulate the things I do in base-R and data.table because it does not have explicit row indices. Maybe this is what I don’t like about dplyr, that it (as far as I know it) abstracts too much away so I lose the feeling for the data.
I totally agree with you here. Most programmers agree it’s simpler to reason about data stored in tables (in most cases). But some tidyverse teachers skip to “the good stuff” way too soon.
filter
makes more sense when users know about logical indexing.mutate
will only make sense when users understand vector manipulation. Because there’s no function likemutate_subset
, which transforms only some rows of a table, users must combine ideas of indexing (logical or otherwise) and manipulation.Yes, sometimes data.frame manipulating code written with base R only can look arcane, and that is a bad thing. But I’ve seen people create elaborate dplyr chains when a base R function slipped into a
mutate
would’ve done the trick.I’ve seen sad posts on stack overflow and RStudio Community of the form “I got this working in base, how to I translate it into dplyr?” I presume this is to avoid the “why didn’t you do it in dplyr?” criticism that rains down fairly freely.
https://h2oai.github.io/db-benchmark/
I guess you haven’t seen these stats?
I came here to see the a comparison in performance between R and Python but unfortunately the whole post just seems to be crafted to make a point against tidyverse.
Firstly, the fact that data.table solves this tedious task quickly and base R or tidyverse does not is exactly the reason why there are different packages. Data.table is very good performance-wise but is considered less user friendly and hence it is less popular. You can use whatever you want based on your use case.
Secondly, the task seems to be designed to be difficult but it is not a real life example – in which circumstances would you have 50 Mio rows x 50 cols CSV? What software would create this as one file (and not fail) rather than split it in few files? How would it be transferred? Why not use other compression standard that is more efficient like parquet or gzip? Why not skip loading the file altogether and connect R/Python to the database directly? And finally if you work with data sets of such size why would you have only 8GB RAM? Why not use a virtual machine that can scale?
It was an actual task (confidential data shared as a “.csv” by a client when we didn’t happen to have a good machine at hand) that happened in the last week. It isn’t what is now considered a big file (around 8GB uncompressed) and is the kind of work typically transferred (compressed) on systems like Box or thumb-drives. It isn’t a hard problem. The data is not in fact hard to create or hard to ingest.
Of course there are ways to solve it (database, bigger machine, out-of-core tools)- but the idea was to try and convey what might happen at first glance if a “tidyverse only R” user and “Pandas Python” user tried the same task in similar circumstances. The answer is: a “tidyverse only” R user would be explaining that this is a hard problem, spinning up a database or virtual machine. While the Python user would have loaded the data and likely finished the task. The outcome is the client might not engage additional R consultants, given the (false) impression R could not handle the task with the same small tools.
It is possible that bad performance has unreasonably lowered expectations of what a small machine can in fact easily do.
Or a follow-up question: why did the
tidyverse
organizers addreadr
to thetidyverse
instead of using or adaptingdata.table
? My impression is: it is for reasons of credit and control, and not for reasons of user experience.User-friendliness is a very subjective thing to claim. data.table has a very consistent syntax dt[i, j, by], while there are tons of functions to be learned to work with dplyr. Same things are typically done with less typing in dt than dplyr, they run faster and use less memory. Yes, a few things need to be learned, but it pays off very quickly. It is like a touch-typing of data manipulation.
Finally, data.table is dependency-free, so prototypes can be easily moved to production without dragging a dozen of tidyverse packages along with it. Which makes dplyr only suitable for a subset of prototyping tasks that has moderately large datasets.
Thus, there are two competing systems, one is fast and portable, another is slow and imports tons of dependencies. Maybe, promoting the better alternative is not a “point against tidyverse”, but a point for using R?
Great post. Agree with all points. data.table is very under appreciated. I also never understood the stigma that base R is hard to learn, must use tidyverse (I think this stigma was created as a way to promote tidyverse). My experience with teaching new users has been more align with yours.
To Marta’s comment: I deal with files this size all the time, not uncommon in 2018. Using the tidyverse on aws or something similar is a pain due to all the dependencies. Especially when you want to spin up many jobs at once for a short time period.
Thank you for the post!
There are two kinds of R users: those who are very fluent and comfortable with programming and those who are less (maybe because they are data analysts with no computer science background).
I belong to the second category: I started with base R and did a few projects, always with a few months between them. It felt as I had to relearn the whole thing over every single time.
I would probably have switched to python, but then came dplyr and it was a real game changer: yes, I did have to learn the 5 dplyr verbs, but it felt just like SQL, so easy and plain that just going over the vignette examples was enough.
Not only can I start a new project after a few months away from data science work, I can actually read and understand my own code from a couple years ago! ;-)
Sometimes I have had speed issues: some function took too long for my taste. Often it was in the data wrangling phase, and it would only run a few times until I got it right, and then I cached the object in an RDS file. No worries there, specially as I usually build a sample tibble to get what I need before running it on the bigger files.
But sometimes the speed issue is in a function that I need to run more often. Then I am really happy to fall back on data.table, and yes, I often get significant improvements. But I struggle to get there. I know it’s just dt[i, j, by], but it takes me forever to get the i, j, by, on, .SD, etc. right. It’s worth it when I know the function will run often enough so that it compensates for the time it takes me to get everything right. If not, I feel I’m better off with the plain, simple dplyr code, even if it is slower!
Like Marta, I seldom analyze files with 1M rows. The datasets I usually work with are at most in the 100k group, and results, even if they are 2, 3 or 5 times slower, appear about the same to me.
So I feel we’re lucky to be in a community where each user may get the tools he/she needs and we need to thank both Matt Dowle and Hadley Wickham for their relentless efforts to lead the development of the tools that allow R to continue to grow and us to become more and more productive.
Completely Agree
Also wholeheartedly agree.
I have colleagues that heavily use data.table, whereas I happen to have a personal preference for tidyverse.
There’s no stigma around either approach and as long as everyone is sensible enough to consider pros and cons of each for specific circumstances then that’s no different to a sensible discussion about pros and cons of using a different language or database or anything else for a specific problem.
Indeed, it’s worth noting that performance and dependencies aren’t the only criteria to consider. Familiarity and availability of support to maintain in future are also crucial to take into account – no point using data.table or tidyverse, or python or R, if there aren’t people around who are able to support what’s been done.
As someone who has strongly encouraged the adoption of R at my work, I think the most important thing that has helped with pushing that adoption has been the R community as a whole.