Menu Home

Base R can be Fast

“Base R” (call it “Pure R”, “Good Old R”, just don’t call it “Old R” or late for dinner) can be fast for in-memory tasks. This is despite the commonly repeated claim that: “packages written in C/C++ are (edit: “always”) faster than R code.”

The benchmark results of “rquery: Fast Data Manipulation in R” really called out for follow-up timing experiments. This note is one such set of experiments, this time concentrating on in-memory (non-database) solutions.

Below is a graph summarizing our new results for a number of in-memory implementations, a range of data sizes, and two different machine types.

Unnamed chunk 2 1

The graph summarizes the performance of four solutions to the “scoring logistic regression by hand” problem:

  • Optimized Base R: a specialized “pre allocate and work with vectorized indices” method. This is fast as it is able to express our particular task in a small number of purely base R vectorized operations. We are hoping to build some teaching materials about this methodology.
  • Idiomatic Base R (shown dashed): an idiomatic R method using stats::aggregate() to solve the problem. This method is re-plotted in both graphs as a dashed line and works as a good division between what is fast versus what is slow.
  • data.table: a straightforward data.table solution (another possible demarcation between fast and slow).
  • dplyr (no grouped filter): a dplyr solution (tuned to work around some known issues).

This benchmarking series reveals a number of surprises. It says: trust conventional wisdom a bit less, and to budget more time for benchmarking pilot experiments in future R projects. Contrary to claims otherwise: base R code can be good code, with some care it can sometimes perform better than package alternatives. There is no need to apologize for writing R code when using R.

Benchmarking details can be found here and here, and plotting details here.

Categories: Coding Computer Science Opinion Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

14 replies

    1. It is rows per millisecond (1/1000th of a second). I chose data.table as the definition of fast in my last article on this topic, as it outperformed all the database methods.

      For example on the EC2 machine data.table processes 2 million rows (or 2 million rows, as each instances generates 2 rows). in 19.4 seconds, yielding a rate of 2000000/(1000* 19.393562) = 103 rows/millisecond– which agrees with the presented graph. We are varying size from 2 rows to 2 million rows by multiples of 10.

      A possibility is I did a poor job using data.table (and if so I apologize). The code is here. Perhaps I need to call a specialized rank() operator (I assumed data.table re-mapped rank() to frank() internally)? The main thing I would do different is try to avoid multiple grouping by using the multi-assignment operator).

      I don’t think it is measurement overhead, as some methods do show up faster than others (and measurement overhead should crush them all together).

      1. It seems like a convoluted presentation of results to me. The simple absolute timing of 3 consecutive runs is what works for me. Any mention of milliseconds and an x-axis maximum of just 1e6 rows is setting off alarm bells. No mention of GB size and no actual run times. If you’re saying a single run did actually take multiple seconds (19.4s you write) then why use divisor of milliseconds rather than seconds? That choice conveyed the wrong message to me. I did click through to your results and I thought I saw the timings were in milliseconds there too (results from microbenchmark) before you divided, before I commented. Here’s my idea of well presented benchmark results (that I did) : https://github.com/Rdatatable/data.table/wiki/Benchmarks-%3A-Grouping. Here’s another from fst’s author : http://blog.fstpackage.org/2018/01/fst_0.8.0/. In both, simple sizes in GB and simple timings in seconds or minutes are conveyed easily. If you have time to wait for 100 runs, then the test is so quick anyway, nobody may care. Which is why 3 runs and reporting the GB size and time of a single run in seconds/minutes is good practice.
        Let’s explore what “1e6 rows” means. We don’t know the row size. Let’s guess at 10 bytes. 1e6 * 10 bytes = 1e7 bytes = 9MB. Well, 9MB is only a little bit bigger than my L3 cache (6MB) but easily fits in my L4 cache (128MB). That’s titchy! And it can be why inferences on small data (1e6 rows can be titchy depending on wide the rows are!) don’t hold when you scale up to multiple GB. I don’t think the reader has these questions or concerns when they read the two good examples I’ve pointed to. But they do with your presentation.

      2. This is a very straightforward presentation. I kept it short for readability. It is a followup from the previous article (so that context applies).

        If you are saying it would be better if the graph “x-axis” said “data size in rows” instead of “data_size” that is a valid point- I’ll make that edit.

        If you are meaning to say it is not of value to you (doesn’t answer followup questions that you feel are important) that is quite different than it being “convoluted.”

        If you are implying there is a math error. I’ve re-checked a few times and asked others to check. microbenchmark in the configuration we used measures in nanoseconds and graphs in various units. We didn’t double divide. If there is some other math error, I’d be happy to re-plot, that is why we kept the results as “.rds”.

        I honestly thought including data.table here was going to be harmless as data.table was going to be fastest. Maybe you feel I wasn’t careful enough with data.table; the reason I was quick with data.table is I honestly believed what everybody keeps telling me: data.table is so many orders of magnitude faster it saves you from having to over-tune.

        If you are asking why I am being so rude as to benchmark: I’ll answer that. I am frankly tired of apologizing for writing R code when using R.

        I kept the experiment to only 3 or 4 columns and tens to hundreds of thousands of rows. Many R users think in terms of rows and columns (both reported), and not bytes. Reader questions and concerns (and there will always be such) can be answered by looking into the complete backing materials (all code, all results machine readable, all graphs, all code to produce the graphs).

        I am using one unit (milliseconds) because I am running at a variety of scales (4 rows to about 4 million rows), and the range was pleasant in milliseconds. Of course that unit is natural for some runs (the 4 rows) and not for others (4 million rows). The graph would look the same no matter what units we used (as long as we use the same units for all runs).

        Yes these are small data sets. But for some unknown reason all but the data.table and the tuned R code take a long time on them (tens of seconds). And some of the methods (again not data.table) manage to get into memory contention on smaller machines (no idea why, and yes that is ridiculous).

        This is a trivial experiment yes, but it seems not a lot of people have done even that much lately.

        Here is one of the raw graphs from the even newer experiment (where I used the newer code suggested here, which I followed up on here). This graph is produced directly by microbenchmark, I didn’t pick the format or the units. And I also felt the only way to plot multiple problem sizes on the same graph was to go to rates (otherwise most of the vertical play of the graph is lost to expressing the experiment size is changing). I felt that running at a variety of sizes was one of the most critical follow-up questions, so that is why I emphasized that. The graph also shows why we stop at only around a million rows: some of the methods we are testing are already taking way too long at even this scale.

        data.table is fastest at large scales. Thats great, it is good to have such a tool. I included it in the comparison because it is important and fast. My feeling (and exactly what I said in the first article): once can take even getting near data.table performance as “being fast.”

      3. I’ve read again my comment and read again yours. I think I was clear and I was only making the points that I wrote, not any others. I was suggesting that GB size (not number of rows) be on the x-axis and that time be on the y-axis (not a rate). This way readers can more easily see if the range tested is relevant for them. For example, if they really are calling the query many times per second (e.g. as sometimes done in Shiny apps) then response time and overhead in milliseconds may be important there expressed as a number of rows per second (or millisecond) because such users usually know those rates they have to manage. However, if the reader has 100GB of data and they tend to run single queries that take a long time (e.g. minutes) but your chart only goes up to 10MB, then the reader knows it isn’t really relevant to them. Some users really don’t mind if their queries take over an hour as they just run it overnight, but they do mind if it takes more than 8 hours because it doesn’t run within one night. To know quickly and easily from the chart’s axis: i) scale of data size (MB/GB) and ii) time (ms/sec/min) is all I believe I wrote. So, yes the plot you included in last comment is better because the reader can now see the scale; i.e. 10’s of seconds vs under 1 second is now clear. Not many people care about the 2s vs 1s difference so they’re pretty much the same. This chart is now so much more helpful without needing to click through to the code behind it. However, 2 million rows would be better stated as the data size in MB or GB so the reader can instantly know whether the data fits in cache (<10MB) or the data fits in RAM and if so whether we’re talking laptop RAM or sever RAM, for example.

  1. Your data.table solution is maybe a bit too “straightforward”. Not only have you an issue with the ties in probability (which you have documented) but also the code is not very efficient.

    Still relatively straightforward but factor 4 faster for 1e6 observations – and without the tie issues – is following code:

    data.table_improved <- function() {
      dDT <- data.table::data.table(dLocal)
      dDT <- dDT[,list(diagnosis=surveyCategory,
                       probability = exp (assessmentTotal * scale ) /
                         sum ( exp ( assessmentTotal * scale ) ))
                 ,subjectID ]
      setorder(dDT, subjectID, probability, -diagnosis)
      dDT <- dDT[,.SD[.N],subjectID]
      setorder(dDT, subjectID)
    }
    
      1. The solution I posted before followed the same straightforward approach as you had for the original solution (and the dplyr solutions) just to increase the efficiency of that line of thinking.

        If you further optimise it, like for example pre-calculate exp(assessmentTotal*scale) – as you have done also in your cframe solution – then speed is still increasing.

        A further improvement – which still uses only out-of-the box techniques – is as following:

        data.table_improved2 <- function() {
         dDT <- data.table::data.table(dLocal)
         setnames(dDT, "surveyCategory", "diagnosis")
         dDT[,expaTs:=exp(assessmentTotal*scale)]
         # precalculate -> this uses gsum internally
         dDT[,sum_expaTs:=sum(expaTs),subjectID] 
         dDT[,probability := expaTs / sum_expaTs]
         dDT[,c("assessmentTotal","expaTs","sum_expaTs"):=NULL]
         setorder(dDT, subjectID, -probability, diagnosis)
         dDT[,.SD[1],subjectID]
        }
        

        This runs faster for me than your cframe solution with nrep > 10000

        Throughputs:

                 method/data_size 2e+04 2e+05 1e+06 2e+06 4e+06
             data.table_improved2  1606  2142  1888  1570  1332
        base R cframe calculation  2472  1788  1013   837   792
        
      2. Thanks. data.table is definitely starting to pull ahead at large scales. For the range of data sizes I originally plotted I reproduce the result as follows:

        Details of this work here.

  2. Very nice! There is an inbuilt constant `.N` instead of adding `one` and calculating `sum`, analogous to `n() `in dplyr. `javrucebo` solution is faster than yours, but I noticed something else: your table has a lot of ‘irrelevant columns’. If you exclude these upfront, it cuts the time down to 2s.

    dLocal <-
    data.table(assessmentTotal = sample.int(10, 1e6, replace = TRUE),
    surveyCategory = sample(c('withdrawal behavior', 'positive re-framing'),
    size = 2e6,
    replace = TRUE),
    subjectID = sample(1:1e5, size = 2e6, replace = TRUE))

    some_strings <- sample(letters, size = 1e6, replace = TRUE)
    for (irrelevant_col in 1:1000) {
    set(dLocal, j = paste0("V", irrelevant_col), value = some_strings)
    }

    scale <- 0.236
    data.table_local <- function() {
    dDT <- dLocal[, .(assessmentTotal, surveyCategory, subjectID)]
    dDT[
    , one := 1 ][
    , probability := exp ( assessmentTotal * scale ) /
    sum ( exp ( assessmentTotal * scale ) ) ,subjectID ][
    , count := sum ( one ) ,subjectID ][
    , rank := rank ( probability ) ,subjectID ][
    rank == count ][
    , diagnosis := surveyCategory ][
    , c('subjectID', 'diagnosis', 'probability') ][
    order(subjectID) ]
    }

    system.time(data.table_local())
    user system elapsed
    2.28 0.02 2.30

    1. Thanks for your help. The examples used in this article purposely already no longer had the irrelevant columns:

      From https://github.com/WinVector/rquery/blob/master/extras/QTimingFollowup/QTiming.md :

      nrep <- 10
      
      dLocal <- mkData(nrep)
      head(dLocal)
      ##   subjectID      surveyCategory assessmentTotal
      ## 1        s1 withdrawal behavior               7
      ## 2        s1 positive re-framing               1
      ## 3       s10 withdrawal behavior               7
      ## 4       s10 positive re-framing               3
      ## 5        s2 withdrawal behavior               0
      ## 6        s2 positive re-framing               2
      

      From https://github.com/WinVector/rquery/blob/master/extras/QTiming.md :

      head(dLocal)
      ##   subjectID      surveyCategory assessmentTotal
      ## 1       0_1 withdrawal behavior               5
      ## 2       0_1 positive re-framing               2
      ## 3       0_2 withdrawal behavior               3
      ## 4       0_2 positive re-framing               4
      ## 5       1_1 withdrawal behavior               5
      ## 6       1_1 positive re-framing               2
      

      I had already removed them as they were supposed to only be part of the article on irrelevant columns. I most definitely did not want them in these measurements as they might obscure details we are interested in. They are not supposed to be in the comparing different systems articles.

  3. Its certainly not surprising that its fast — after all you are avoiding things which are slow in R and delegating heavy lifting to optimised vector operations (which are implemented in C in the first place). Dplyr will almost always be slower, since it is an abstract interface which is more difficult to translate to efficient R, but the point of dplyr is expressivity and ease of maintenance (I mean, compare the length and complexity of base R and dplyr code).

    R is slow because its underlaying implementation is sometimes woefully inefficient (and its interpreter is rather slow). It should really use data structures like immutable arrays, for instance. Large data structures coupled with R’s bulk copy on write mechanism can’t be efficient… Of course, for straightforward manipulation/aggregation it doesn’t matter much.

    1. This is close to my point: base R can be fast (in terms of what is fast or slow in terms of R itself). One achieves this by staying in a sub-dialect that has a good underlying implementation. The base R code is not complex:

      # base-R function sequencing accross categories
      base_r_calculate_cframe <- function(d) {
        d$probability <- exp(d$assessmentTotal * scale)
        cframe <- build_cframe(d$subjectID, d$surveyCategory)
        selections <- grouped_arg_max(cframe, d$probability)
        totals <- grouped_sum(cframe, d$probability)
        res <- data.frame(subjectID = rownames(cframe),
                          diagnosis = d$surveyCategory[selections],
                          probability = d$probability[selections]/totals,
                          stringsAsFactors = FALSE)
        res
      }
      

      It delegates the work to three functions build_cframe(), grouped_arg_max(), and grouped_sum(). If we take these functions as “known” (pretend they were already in R and had sufficient documentation, examples, and training) the code is fairly concise and straightforward. Then as a plus when one digs in one finds out all three of those functions have pure R implementations that are fast enough to complete with the tuned data manipulation packages.

%d bloggers like this: