Menu Home

Using a Column as a Column Index

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

Let’s say we have some example data:

df <- data.frame(x = c(1, 2, 3, 4),
                 y = c(5, 6, 7, 8),
                 choice = c("x", "y", "x", "z"),
                 stringsAsFactors = FALSE)

Suppose our goal is to build a new column called “derived” that for each row chooses from the column named by the names in the column called “choice”. In our case the result would be “[1, 6, 3, NA]” (the last value being “NA” as “z” is not a column of df).

How does one do this?

In base-R we can use a matrix-index to specify the set of data cells we want (please see “help([)” for details):

dtmp <- df[, 
           intersect(df$choice, colnames(df)), 
           drop = FALSE]
dtmp[
  cbind(seq_len(nrow(df)),
        match(df$choice, colnames(dtmp)))]
# [1]  1  6  3 NA

data.table has some known notation and solutions for this (please see “help(.SD)” for details):

library("data.table")
dt <- as.data.table(df)

dt[, derived := .SD[[choice]], by = choice][]

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA

What we are doing is grouping by “choice”. In each group the choice column is a constant, and data.table supplies the name of the group as a scalar in the variable named choice. This lets us assign values from that column all at once (.SD, being data.table’s notation for the sub-data.table representing the group). Some discussion of the ideas can be found here.

What is going on: is we are reducing the choice column to a series of scalars- so we can then use the scalar column selection operators (such as “[[]]”) to access columns.

There is also a dplyr solution:

df %>% 
  mutate(c2 = ifelse(choice %in% colnames(df),
                     choice,
                     colnames(df)[[1]])) %>%
  group_by(c2) %>%
  mutate(derived = .data[[ c2[[1]] ]]) %>%
  ungroup() %>%
  mutate(derived = ifelse(choice %in% colnames(df),
                          derived,
                          NA)) %>%
  select(-c2)
# # A tibble: 4 x 4
#       x     y choice derived
#   <dbl> <dbl> <chr>    <dbl>
# 1     1     5 x            1
# 2     2     6 y            6
# 3     3     7 x            3
# 4     4     8 z           NA

It is essentially the same trick as the data.table solution: break the problem into groups where the choice is not varying (allowing us to use scalar column specifiers to do the work). There is some extra care to deal with mis-matches. Also, the “[[1]]” is converting the column of choice values (all of which are the same due to the grouping by choice) into a scalar. We call this trick a “pseudo aggregator”, and we often teach the technique in SQL (where it is often min() or max()).

We have wrapped base-R and data.table implementations into method a in rquery/rqdatatable here. It works like this:


library("rqdatatable")

lookup_by_column(df, "choice", "derived")

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA

Also, rquery can perform this operation on database tables! (In database terms this looks a lot like a variation of a controlled coalesce.)

These are some ideas that will eventually be expanded and make it into our theory of coordinatized data (please also see here) and Practical Data Science in R 2nd Edition.

Edit: I know nobody asked: but we now have timings!

Categories: Coding Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

15 replies

    1. This won’t work if the column names aren’t valid names for R variables.

      df %>%
      rename("1x" = x, "1y" = y) %>%
      mutate(choice = paste0(1, choice)) %>%
      rowwise() %>%
      mutate(derived = ifelse(choice %in% colnames(df),
      eval(parse(text = choice)),
      NA))

      Also, I shudder whenever I see non-standard evaluation (eval, parse) inside unrelated non-standard evaluation (mutate, %>%). Not saying it’s a bad habit, but it’s one that’s bit me hard in the past.

      1. I try to avoid NSE unless I really need something from it that I can’t get any other way. I also consider setting the column names to something safe an important early data ingest step, so I have no problem with code that expects safe names.

  1. The function get0 could be useful here, we could use it to simplify George’s answer :

    df %>% rowwise() %>%
    mutate(derived = get0(choice, ifnotfound = NA))

    but unfortunately it’s not robust because z could exist in the global environment. get0 has an inherits Boolean argument that we can use here but environments are tricky under mutate, so better use pmap as it’s recommended over rowwise anyway.

    df$derived <-
    pmap_dbl(df, ~with(list(...), get0(choice, ifnotfound = NA,inherits = FALSE)))

    # or explicitly :

    df$derived <-
    pmap_dbl(df, function(x,y,choice) get0(choice, ifnotfound = NA,inherits = FALSE))

    For an idiomatic tidyverse way I would do:

    df <- rowid_to_column(df)
    df %>%
    gather(,derived,x,y) %>%
    filter(choice==key) %>%
    left_join(df3,.) %>%
    select(-key, -rowid)

    if there was not this annoying NA we could just do

  2. I’m sure there’s better ways but my thought with dplyr was to treat it as a lookup problem with joins like so:

    df %>%
        mutate(id = row_number()) %>%
        {
             x <- .
    
             left_join(x, 
             x %>%
                 select(-choice) %>%
                 tidyr::gather(choice, outcome, x:y),
             by = c('choice', 'id')
         )
    
    }
    
  3. This is my dplyr solution.

    df %>%
    rowwise() %>%
    mutate(derived = c(x, y)[match(choice, colnames(.))])

      1. Actually it is interesting to look at what comes out with the rowwise() removed. I get the following.

        It is wrong, but I am unclear where it gets the 2 from.

        library("dplyr")
        packageVersion("dplyr")
        # [1] ‘0.7.6’
        
        df <- data.frame(x = c(1, 2, 3, 4),
                         y = c(5, 6, 7, 8),
                         choice = c("x", "y", "x", "z"),
                         stringsAsFactors = FALSE)
        df %>%
          mutate(derived = c(x, y)[match(choice, colnames(.))])
        
        #   x y choice derived
        # 1 1 5      x       1
        # 2 2 6      y       2
        # 3 3 7      x       1
        # 4 4 8      z      NA
        
%d bloggers like this: