A big thank you to Databricks for working with us and sharing: rquery: Practical Big Data Transforms for R-Spark Users How to use rquery with Apache Spark on Databricks rquery on Databricks is a great data science tool.
Estimated reading time: 19 seconds
We here at Win-Vector LLC have some really big news we would please like the R-community’s help sharing. vtreat version 1.2.0 is now available on CRAN, and this version of vtreat can now implement its data cleaning and preparation steps on databases and big data systems such as Apache Spark. […]
Estimated reading time: 1 minute
Trick question: is a 10,000 cell numeric data.frame big or small? In the era of “big data” 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box). The joking answer is: it is small when they are selling you […]
Estimated reading time: 6 minutes
We are excited to announce the rquery R package. rquery is Win-Vector LLC‘s currently in development big data query tool for R. rquery supplies set of operators inspired by Edgar F. Codd‘s relational algebra (updated to reflect lessons learned from working with R, SQL, and dplyr at big data scale […]
Estimated reading time: 2 minutes
For some time we have been teaching R users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis." The idea behind the advice is: working with fewer columns makes for quicker queries. photo: Jacques Henri […]
Estimated reading time: 4 minutes
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
Estimated reading time: 1 minute
Win-Vector LLC is proud to introduce two important new tool families (with documentation) in the 0.5.0 version of seplyr (also now available on CRAN): partition_mutate_se() / partition_mutate_qt(): these are query planners/optimizers that work over dplyr::mutate() assignments. When using big-data systems through R (such as PostgreSQL or Apache Spark) these planners […]
Estimated reading time: 2 minutes
Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R. From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure.
Estimated reading time: 2 minutes
Recently I noticed that the R package sparklyr had the following odd behavior: suppressPackageStartupMessages(library("dplyr")) library("sparklyr") packageVersion("dplyr") #> [1] ‘0.7.2.9000’ packageVersion("sparklyr") #> [1] ‘0.6.2’ packageVersion("dbplyr") #> [1] ‘1.1.0.9000’ sc <- spark_connect(master = ‘local’) #> * Using Spark: 2.1.0 d <- dplyr::copy_to(sc, data.frame(x = 1:2)) dim(d) #> [1] NA ncol(d) #> [1] […]
Estimated reading time: 5 minutes
In our latest R and Big Data article we discuss replyr. Why replyr replyr stands for REmote PLYing of big data for R. Why should R users try replyr? Because it lets you take a number of common working patterns and apply them to remote data (such as databases or […]
Estimated reading time: 15 minutes