While working on a large client project using Sparklyr and multinomial regression we recently ran into a problem: Apache Spark chooses the order of multinomial regression outcome targets, whereas R users are used to choosing the order of the targets (please see here for some details). So to make things […]
Recently I noticed that the R package sparklyr had the following odd behavior: suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #>  ‘0.7.2.9000’ packageVersion("sparklyr") #>  ‘0.6.2’ packageVersion("dbplyr") #>  ‘22.214.171.12400’ sc <- spark_connect(master = ‘local’) #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #>  NA ncol(d) #>  […]
In our latest R and Big Data article we discuss replyr. Why replyr replyr stands for REmote PLYing of big data for R. Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or […]
In our latest installment of “R and big data” let’s again discuss the task of left joining many tables from a data warehouse using R and a system called "a join controller" (last discussed here). One of the great advantages to specifying complicated sequences of operations in data (rather than […]
This note describes a useful replyr tool we call a "join controller" (and is part of our "R and Big Data" series, please see here for the introduction, and here for one our big data courses).
In our latest “R and big data” article we show how to manage intermediate results in non-trivial Apache Spark workflows using R, sparklyr, dplyr, and replyr.
Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”): There should be one– and preferably only one –obvious way to do it. Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk […]
Our next "R and big data tip" is: summarizing big data. We always say "if you are not looking at the data, you are not doing science"- and for big data you are very dependent on summaries (as you can’t actually look at everything). Simple question: is there an easy […]
When working with big data with R (say, using Spark and sparklyr) we have found it very convenient to keep data handles in a neat list or data_frame. Please read on for our handy hints on keeping your data handles neat.
Win-Vector LLC has recently been teaching how to use R with big data through Spark and sparklyr. We have also been helping clients become productive on R/Spark infrastructure through direct consulting and bespoke training. I thought this would be a good time to talk about the power of working with […]