Menu Home

rquery: Fast Data Manipulation in R

Win-Vector LLC recently announced the rquery R package, an operator based query generator.

In this note I want to share some exciting and favorable initial rquery benchmark timings.

Note we have now (1-16-2018) re-run this benchmark with a faster, better tuned, version of the data.table solution (same package, just better use of it).

Let’s take a look at rquery’s new “ad hoc” mode (made convenient through wrapr‘s new “wrapr_applicable” feature). This is where rquery works on in-memory data.frame data by sending it to a database, processing on the database, and then pulling the data back. We concede this is a strange way to process data, and not rquery’s primary purpose (the primary purpose being generation of safe high performance SQL for big data engines such as Spark and PostgreSQL). However, our experiments show that it is in fact a competitive technique.

We’ve summarized the results of several experiments (experiment details here) in the following graph (graphing code here). The benchmark task was hand implementing logistic regression scoring. This is an example query we have been using for some time.

NewImage

The graph above the distribution of repeated run times for:

  • data.table in memory: How long it takes data.table to complete the task starting from an in-memory data.frame. We will take data.table’s as the definition of “fast for R”, as data.table is commonly seen as the fastest system for in-memory data manipulation in R.
  • rquery in memory: How long it takes rquery to complete the task starting from an in-memory data.frame and returning to an in-memory data.frame. The actual implementation is performed by moving data to a PostgreSQL database for processing and then collecting the result back. The observed average runtime is slower than data.table. However, as we will see, that is actually a good result.
  • dplyr tbl in memory: A standard dplyr in-memory pipeline working on a data.frame that is first converted into a tbl structure (to see if this is a significant speedup).
  • dplyr in memory no grouped filter: A re-structured dplyr pipeline avoiding filtering inside a grouped context.
  • dplyr from memory to db and back: Moved the data to a database, perform dplyr steps there, and then copy the data back.

As is detailed in the experiment backing materials the task is processing 40,000 records through either of the following non-trivial pipelines (the first one for dplyr, and the second one for rquery):

dplyr_pipeline <- . %>%
  group_by(subjectID) %>%
    mutate(probability =
             exp(assessmentTotal * scale)/
             sum(exp(assessmentTotal * scale), na.rm = TRUE)) %>%
    arrange(probability, surveyCategory) %>%
    filter(row_number() == n()) %>%
    ungroup() %>%
    rename(diagnosis = surveyCategory) %>%
    select(subjectID, diagnosis, probability) %>%
    arrange(subjectID)
rquery_pipeline <- . := {
    extend_nse(.,
             probability :=
               exp(assessmentTotal * scale)/
               sum(exp(assessmentTotal * scale)),
             count := count(1),
             partitionby = 'subjectID') %.>%
    extend_nse(.,
               rank := rank(),
               partitionby = 'subjectID',
               orderby = c('probability', 'surveyCategory'))  %.>%
    rename_columns(., 'diagnosis' := 'surveyCategory') %.>%
    select_rows_nse(., rank == count) %.>%
    select_columns(., c('subjectID', 
                        'diagnosis', 
                        'probability')) %.>%
    orderby(., 'subjectID') 
}

The primary purpose of rquery is pure in-database processing; in-memory processing is only a convenience. So let’s look at database only timings. In these tasks the data starts in the database (as is typical for large data projects) and is either consumed (by a row count called “count”) or saved as a database table (called “land”). In neither case does data move in or out of the database (skipping those overheads).

NewImage

And, as we demonstrated earlier, rquery’s query generator has additional features (not shown here) that can yield additional speedup in production environments (and also increase query safety and query power).

I think results this good this early in the project are very promising. They are the product of some design decisions that I think will continue to pay off and some intentional scope restrictions (including that rquery only targets SQL engines, and initially only a few including Spark, PostgreSQL, and with some caveats SQLite). More work needs to be done to add all the SQL translation features needed for general production use, but we think we have demonstrated a good foundation to build on.

Categories: Computer Science Pragmatic Data Science

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

3 replies

      1. I accidentally broke the link, thanks for pointing that out. Now fixed! I have a follow-up article on the in-memory timings in the queue, and then hope to get the chance to build some training materials on the fast in-memory methods (hint: it is all R batch vectorized indexing).

%d bloggers like this: