Trick question: is a 10,000 cell numeric data.frame big or small? In the era of “big data” 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box). The joking answer is: it is small when they are selling you […]
For some time we have been teaching R users "when working with wide tables on Spark or on databases: narrow to the columns you really want to work with early in your analysis." The idea behind the advice is: working with fewer columns makes for quicker queries. photo: Jacques Henri […]
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
Practical Data Science with R errata update: Java SQLScrewdriver replaced by R procedures and article
We have updated the errata for Practical Data Science with R to reflect that it is no longer worth the effort to use the Java version of SQLScrewdriver as described. We are very sorry for any confusion, trouble, or wasted effort bringing in Java software (something we are very familiar […]
It is a bit of a shock when R dplyr users switch from using a tbl implementation based on R in-memory data.frames to one based on a remote database or service. A lot of the power and convenience of the dplyr notation is hard to maintain with these more restricted […]
I have previously written on using containerized PostgreSQL with R. This show the steps for using containerized MySQL with R.
A great number of readers reacted very positively to Nina Zumel‘s article Using PostgreSQL in R: A quick how-to. Part of the reason is she described an incredibly powerful data science pattern: using a formerly expensive permanent system infrastructure as a simple transient tool. In her case the tools were […]