One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script compatibility). R is conveniently distributed as a single package (with automated install of additional libraries).
Want to do some data analysis? Install R, load your data, and go. You don’t expect to spend hours on system administration just to get back to your task.
Python, being a popular general purpose language does not have this advantage, but thanks to Anaconda from Continuum Analytics you can skip (or at least delegate) a lot of the system environment imposed pain. With Anaconda trying out Python packages (Jupyter, scikit-learn, pandas, numpy, sympy, cvxopt, bokeh, and more) becomes safe and pleasant.
Software engineers are forced to decide early if their tools are designed to support complexity or enforce clarity (single install, only current version, and so on). Unfortunately for the end-users, complexity almost always wins. Seemingly clever ideas like late-binding and dependency injection move support costs from developers to users and everything becomes a variation of dll-hell.
It gets to the point where it is considered normal and acceptable to require provisioning of images, virtual machines, or docker containers just to try a piece of software. Things that were once linked now struggle to find each other over perverse message buses.
Lt. Gen. Brian Horrocks: “We will install all of the packages at once by air-dropping a combination of pre-prepared Amazon Machine Images, docker containers, VMWare virtual machines, Heroku Apps, and BSD jails. I’m not saying that this will be the easiest party that we’ve ever attended, but I still wouldn’t miss it for the world.”
However, these containers have to come from somewhere and I feel it makes sense to try and figure out if a piece of software can be successfully installed or used at all (versus being unmaintainable but artificially kept alive through container services). This is why I have taken the trouble to run down references and actually work out how to install things (such as how to install Caffe deep neural nets from scratch, or R on EC2, or even basic Hadoop on EC2). Yes there are ready-made containers and images for these things- but at some point you accumulate enough of these incompatible sub-universes, you have come “a bullshit too far”, and the machine stops.
This is why I was initially loathe to use Anaconda for my Python installs. There is already too much learned helplessness in software engineering. And Anaconda would represent one more dependency that comes with its own costs and could break. However after seeing Peter Wang‘s talk at the 2015 Data Science Summit & DATO Conference I decided to give Anaconda a try.
Summary: right now it works. My long instructions involving Python, pip, and many packages repositories get boiled down to something as short as the following (this example on Apple OSX):
# Install Anaconda from https://store.continuum.io/cshop/anaconda/ # add a few more packages anaconda/bin/conda install -c https://conda.binstar.org/r rpy2 anaconda/bin/conda install cvxopt # launch launcher open anaconda/Launcher.app # Then inside a Jupyter worksheet: %%R install.packages('ggplot2',repos='http://cran.us.r-project.org')
You now have your own copy of all the common Python scientific libraries and a second install of R (so you do have to re-install any libraries you want to use) ready to work together through the rpy2
library. With minor changes (I decided to take the time to jump from Python 2 to 3) my old worksheets worked (ex1, ex2).
I would say definitely give Anaconda a try. Anaconda is responsible for installing the entire ecosystem (including the copy of R it wants to use) so the Anaconda developers directly experience “integration debt” (and presumably act in their own interest and continuously reduce this debt).
Categories: Computer Science Exciting Techniques Opinion
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Anaconda is the only way to do DS in python. I think DS in python is generally a mistake for most stuff you need for commercial work, but, for example, its random forest is superior to the ones I have used in R, and scikit is generally a good library with a lot of well written code in it. Python does have things which are not available in R; usable neural net code for example, and its strengths as a development language could make it a good choice for blue sky research (probably why there is usable NN code).
Why I think python is potentially bad for DS projects: weak at statistics, Pandas is not as good as data.tables on any axis of comparison, NP is weak compared to native array types, no real shiny equivalent (Bokeh … meh), plotting is not generally as good, etc. Basically, python is a good language which is jury rigged to do DS; R is designed to do it, and nothing else. Python DS is often like eating stew with a fork; doable, but not as efficient as the R spoon.
That said, Anaconda makes it possible. Using ports or system python is a nightmare.
Hi Scott, sorry to disagree on some points. Pandas – it was inspired from data frame and made huge improvements. It is possibly one of the strongest points in python DS. Plotting – not sure what you used, but matplotlib, seaborn, plotly, bokeh, ggplot can almost do everything. Agreed that ggplot2 is slightly better but thats it. There are some exotic statistics packages in R that are not there in python, but it provides a good enough basic functionality & if you understand basic MLe’s, filters and optimization you can code the exotic stuff yourself and in this way have a lot more control over your code.