We have a new rquery vignette here: Working with Many Columns. This is an attempt to get back to writing about how to use the package to work with data (versus the other-day’s discussion of package design/implementation). Please check it out.

Estimated reading time: 21 seconds

Introduction I would like to talk about some of the design principles underlying the data_algebra package (and also in its sibling rquery package). The data_algebra package is a query generator that can act on either Pandas data frames or on SQL tables. This is discussed on the project site and […]

Estimated reading time: 31 minutes

Our goal has been to make rquery the best query generation system for R (and to make data_algebra the best query generator for Python). Lets see what rquery is good at, and what new features are making rquery better.

Estimated reading time: 10 minutes

Introduction rquery is a data wrangling system designed to express complex data manipulation as a series of simple data transforms. This is in the spirit of R’s base::transform(), or dplyr’s dplyr::mutate() and uses a pipe in the style popularized in R with magrittr. The operators themselves follow the selections in […]

Estimated reading time: 14 minutes

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and […]

Estimated reading time: 25 minutes

This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement. The original published timings were as follows: With performance metrics: measurements are marketing. So let’s dig in the above a bit.

Estimated reading time: 3 minutes

I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem. Please read on for a brief benchmark comparing these methods/solutions.

Estimated reading time: 6 minutes

A big thank you to Databricks for working with us and sharing: rquery: Practical Big Data Transforms for R-Spark Users How to use rquery with Apache Spark on Databricks rquery on Databricks is a great data science tool.

Estimated reading time: 19 seconds

rquery and rqdatatable are new R packages for data wrangling; either at scale (in databases, or big data systems such as Apache Spark), or in-memory. The packages speed up both execution (through optimizations) and development (though a good mental model and up-front error checking) for data wrangling tasks. Win-Vector LLC‘s […]

Estimated reading time: 1 minute

Introduction In this note we will show how to speed up work in R by partitioning data and process-level parallelization. We will show the technique with three different R packages: rqdatatable, data.table, and dplyr. The methods shown will also work with base-R and other packages. For each of the above […]

Estimated reading time: 16 minutes