Menu Home

R Tip: Consider radix Sort

R tip: consider using radix sort.

The “method = "radix"” option can greatly speed up sorting and ordering tables in R.

For a 1 million row table the speedup is already as much as 35 times (around 9.6 seconds versus 3 tenths of a second). Below is an excerpt from an experiment sorting showing default settings and showing radix sort (full code here).


timings <- microbenchmark(
  order_default = d[order(d$col_a, d$col_b, d$col_c, d$col_x), , 
                    drop = FALSE],
  order_radix = d[order(d$col_a, d$col_b, d$col_c, d$col_x,
                        method = "radix"), ,
                  drop = FALSE],
  check = my_check,
  times = 10L)

print(timings)
## Unit: milliseconds
##           expr       min        lq      mean    median        uq
##  order_default 9531.2865 9653.6827 9759.8929 9690.6702 9833.2170
##    order_radix  262.1377  263.3226  278.2547  265.1452  274.2476
##         max neval
##  10329.3520    10
##    382.2544    10
Unnamed chunk 1 1

This speedup is possible because Matt Dowle and Arun Srinivasan of the data.table team generously ported their radix sorting code into base-R! Please see help(sort) for details. So data.table is not only the best data manipulation package in R, the team actually works to improve R itself. This is what is meant by "R community" and what is needed to keep R vibrant and alive.

Edit/Note: Iñaki Úcar shared at least 2 good points in a follow-up article: if you are using factors you get radix sort for free (for technical reasons I tend to delay/disable conversion to factors), and I didn’t mention the loss of control of collation order. Because of that I am changing the article title from “R tip: Use Radix Sort” to “R Tip: Consider radix Sort”.

Categories: Coding Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

5 replies

      1. Thanks for that. I assume ZJ has good intent, but after I while I too started to feel uncomfortable with “men” (Win-Vector LLC itself is a 50% women owned company).

        Like

  1. Thanks for this. I had a question, I thought that the default was “radix” at least for short vectors (less than 2^31 elements as “short”)

    Like

    1. I myself had not read the help carefully enough at first (and did not think about this point until you raised it, thanks!).

      It turns out: the default method selection (via match.arg()) is "auto" and "auto" will switch to "radix" for numerics, factors, and logicals ; but not for character types. So an explicit setting of “method = "radix"” is not a no-op as that avoids the following bit of code (from print(order)):

       if (method == "auto") {
          useRadix <- all(vapply(z, function(x) {
            (is.numeric(x) || is.factor(x) || is.logical(x)) &&
              is.integer(length(x))
          }, logical(1L)))
          method <- if (useRadix)
            "radix"
          else "shell"
        }
      

      I am guessing the reason is radix sort doesn’t obey the collating sequence of the locale, so using it on character types without asking the user could be an issue. For an application where you are using sorting to move identically keyed rows together this isn’t going to be a problem. For our example the two orders were the same for the keys we had as we did confirm identical results.

      Like

%d bloggers like this: