I’ve been seeing a lot of hot takes on if one should do data science in R or in Python. I’ll comment generally on the topic, and then add my own myopic gear-head micro benchmark.
I’ll jump in: If learning the language is the big step: then you are a beginner in the data science field. So the right choice is: work with others and use the tools they are most able to teach you.
After that there are other considerations: what/who are you working with or integrating with. If you are working with statisticians, likely they will want R. If you are working with software engineers, likely they will want Python. If you are actually adding value in terms of translating business needs, picking machine learning models, methods for organizing data, designing experiments, controlling for bias, reducing variance: then programming is the least of your worries.
And the part I really wanted to write about from the start: what about the tooling?
The answer is: both Python and R are slow interpreted languages. For neither one can you expect to execute non-trivial code per row or per instance of data. This is in contrast to Java, which has a fast interpreter and can specify operations on instances instead of relying on vectorized notation. Vectorized methodology is specifying operations on columns, that are applied uniformly to all rows by the underlying data system.
So one can ask: is vectorized Python fast or slow compared to vectorized R?
In my opinion the default implementations of both languages are in fact slow, even when vectorized. However each language has access to a very fast specialized data frame system. In R, it is hands-down data.table. In Python, I am coming to think it may be Polars.
Some benchmarks I trust on this are here.
I’ve only recently started experimenting with building a Polars engine for the data algebra (the data algebra being a system for manipulating data in memory or generating the equivalent SQL for databases). The data algebra was always intended to have multiple data engines. I started and abandoned a few other data algebra engines, as the target system were in fact slower and less expressive than the current Pandas engine. Many of the “we are great at scale” data engines choke on common application tasks such as applying computing summaries over groups, and are fast on uninteresting benchmark-only tasks.
My new Polars adapter is not complete, but developing against Polars has been rapid, enjoyable, and quite impressive.
Here are my latest results on a simple grouped application task in Python and in R:
Most importantly: the same speed is available both in R and in Python, if you use the appropriate package.
Notice in R, of the packages tested, only those based on data.table (data.table, dtplyr, and rqdatatable) are fast (achieve sub-second task timings). In Python things are a bit better, but notice Polars and packages using Polars pulling ahead of the group. (And I have seem Pandas perform much slower on non-trivial tasks with more columns.)
The purpose of the data algebra is to have a unified Codd relational style notation for data manipulation that can be used both on common data frame engines (such as Pandas in memory) and also generate SQL. It was always intended as a system for cleanly specifying data transforms, and is a language for composing transforms. Pandas, Polars and SQL are realizations of such transforms over data. Being able to switch the implementation at run time is a big advantage, one can run the same code locally or at scale in a remote database. Adding Polars as a data algebra engine is an exciting opportunity, a fast realization of what I consider to be a very convenient and powerful notation.
(Python benchmark code here, R benchmark code here)
Thanks for this interesting post. Didn’t know about Polar yet, but I am also rather an R expert than a Python expert.
There is an RPolars in development (ref: https://rpolars.github.io ). However, for R I strongly recommend data.table.
Thanks, interesting write-up! Nice to see the code, and to see the dtplyr approach included. For timings, I now prefer bench over microbenchmark, as it is more explicit about garbage collection. I found that often there are noticeable outliers.