Menu Home

New series: R and big data (concentrating on Spark and sparklyr)

Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with big-data using R, share some hints, and even admit to some of the warts found in this combination of systems.

The ability to perform sophisticated analyses and modeling on “big data” with R is rapidly improving, and this is the time for businesses to invest in the technology. Win-Vector can be your key partner in methodology development and training (through our consulting and training practices).

We Can Do It

J. Howard Miller, 1943.

The field is exciting, rapidly evolving, and even a touch dangerous. We invite you to start using Spark through R and are starting a new series of articles tagged “R and big data” to help you produce production quality solutions quickly.

Please read on for a brief description of our new articles series: “R and big data.”


R is a best of breed in-memory analytics platform. R allows the analyst to write programs that operate over their data and bring in a huge suite of powerful statistical techniques and machine learning procedures. Spark is an analytics platform designed to operate over big data that exposes some of its own statistical and machine learning capabilities. R can now be operated “over Spark“. That is: R programs can delegate tasks to Spark clusters and issue commands to Spark clusters. In some cases the syntax for operating over Spark is deliberately identical to working over data stored in R.

Why R and Spark

The advantages are:

  • Spark can work at a scale and speed far larger than native R . The ability to send work to Spark increases R‘s capabilities.
  • R has machine learning and statistical capabilities that go far beyond what is available on Spark or any other “big data” system (many of which are descended from report generation or basic analytics). The ability to use specialized R methods on data samples yields additional capabilities.
  • R and Spark can share code and data.

The R/Spark combination is not the only show in town; but it is a powerful capability that may not be safe to ignore. We will also talk about additional tools that can be brought into the mix: such as the powerful large scale machine learning capabilities from h2o

The warts

Frankly a lot of this is very new, and still on the “bleeding edge.” Spark 2.x has only been available in stable form since July 26, 2016 (or just under a year). Spark 2.x is much more capable than the Spark 1.x series in terms of both data manipulation and machine learning, so we strongly suggest clients strongly insist on Spark 2.x clusters from their infrastructure vendors (such as Couldera, Hortonworks, MapR, and others) despite having only become available in these packaged solutions recently. The sparklyr adapter itself was first available on CRAN only as of September 24th, 2016. And SparkR only started distributing with Spark 1.4 as of June 2015.

While R/Spark is indeed a powerful combination, nobody seems to sharing a lot of production experiences and best practices whith it yet.

Some of the problems are sins of optimism. A lot of people still confuse successfully standing a cluster up with effectively using it. Other people confuse statistical and procedures available in in-memory R (which are very broad and often quite mature) with those available in Spark (which are less numerous and less mature).

Our goal

What we want to do with the “R and big data” series is:

  • Give a taste of some of the power of the R/Spark combination.
  • Share a “capabilities and readiness” checklist you should apply when evaluating infrastructure.
  • Start to publicly document R/Spark best practices.
  • Describe some of the warts and how to work around them.
  • Share fun tricks and techniques that make working with R/Spark much easier and more effective.

The start

Our next article in this series will be up soon and will discuss the nature of data-handles in Sparklyr (one of the R/Spark interfaces) and how to manage your data inventory neatly.

Please follow “R and Big Data” for additional articles as they come out.

Categories: Administrativia Pragmatic Data Science Tutorials

Tagged as:


Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

5 replies

  1. hi,
    I’m looking forward to your guides :)

    I have already started working with r and sparklyr (cloudera) and I have a certain problem :

    1. when i upload dataframe (grouped by) to spark and try to work with do() (dplyr) i got empty data.
    do you know how to work on dataframe on spark on each group ?


    1. Great question, thank you very much for asking it here.

      The issue is that a lot of dplyr/tidyr functions currently are only implemented for local data.frames. What happens is: the function (say summary for example, even though that is actually an S3-generic from base) ends up getting applied to the sparklyr handle, and not to the remote data the handle is referring to (please see here for an example).

      The replyr package (available on CRAN and in dev, right now we suggest using the dev version) supplies some replacement functions (such as a split that works on Spark data). I have an example solution here based on your desired workflow. The punchline is you can write working code that looks like the following:

      diris %>% 
        replyr_split('Species') %>%
        lapply(f2) %>%

      I am going to be writing a lot on topics like this in this series.

  2. thanks :) its work.

    but what if i want to work on big number of groups ?


    1. Then you must design the function you want to apply to be “group compatible” and apply it to grouped data instead of using do(). In this case replacing head(2) with filter(between(row_number(), 1, 2)) does the trick, but such translations may not always be possible. I’ve extended the example to work some variations.