# Using a Column as a Column Index

We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.

Let’s say we have some example data:

```df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
```

Suppose our goal is to build a new column called “derived” that for each row chooses from the column named by the names in the column called “choice”. In our case the result would be “[1, 6, 3, NA]” (the last value being “NA” as “z” is not a column of df).

How does one do this?

In base-R we can use a matrix-index to specify the set of data cells we want (please see “help(`[`)” for details):

```dtmp <- df[,
intersect(df\$choice, colnames(df)),
drop = FALSE]
dtmp[
cbind(seq_len(nrow(df)),
match(df\$choice, colnames(dtmp)))]
#   1  6  3 NA
```

data.table has some known notation and solutions for this (please see “help(.SD)” for details):

```library("data.table")
dt <- as.data.table(df)

dt[, derived := .SD[[choice]], by = choice][]

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA
```

What we are doing is grouping by “choice”. In each group the choice column is a constant, and data.table supplies the name of the group as a scalar in the variable named choice. This lets us assign values from that column all at once (.SD, being data.table’s notation for the sub-data.table representing the group). Some discussion of the ideas can be found here.

What is going on: is we are reducing the choice column to a series of scalars- so we can then use the scalar column selection operators (such as “[[]]”) to access columns.

There is also a dplyr solution:

```df %>%
mutate(c2 = ifelse(choice %in% colnames(df),
choice,
colnames(df)[])) %>%
group_by(c2) %>%
mutate(derived = .data[[ c2[] ]]) %>%
ungroup() %>%
mutate(derived = ifelse(choice %in% colnames(df),
derived,
NA)) %>%
select(-c2)
# # A tibble: 4 x 4
#       x     y choice derived
#   <dbl> <dbl> <chr>    <dbl>
# 1     1     5 x            1
# 2     2     6 y            6
# 3     3     7 x            3
# 4     4     8 z           NA
```

It is essentially the same trick as the data.table solution: break the problem into groups where the choice is not varying (allowing us to use scalar column specifiers to do the work). There is some extra care to deal with mis-matches. Also, the “[]” is converting the column of choice values (all of which are the same due to the grouping by choice) into a scalar. We call this trick a “pseudo aggregator”, and we often teach the technique in SQL (where it is often min() or max()).

We have wrapped base-R and data.table implementations into method a in rquery/rqdatatable here. It works like this:

```
library("rqdatatable")

lookup_by_column(df, "choice", "derived")

#    x y choice derived
# 1: 1 5      x       1
# 2: 2 6      y       6
# 3: 3 7      x       3
# 4: 4 8      z      NA
```

Also, rquery can perform this operation on database tables! (In database terms this looks a lot like a variation of a controlled coalesce.)

These are some ideas that will eventually be expanded and make it into our theory of coordinatized data (please also see here) and Practical Data Science in R 2nd Edition.

Edit: I know nobody asked: but we now have timings!

Categories: Coding Tutorials ### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 15 replies ›

1. George Moroz says:

My dplyr solution will be:

```df %>%
rowwise() %>%
mutate(derived = ifelse(choice %in% colnames(df),
eval(parse(text = choice)),
NA))
```
1. Nathan says:

This won’t work if the column names aren’t valid names for R variables.

```df %>% rename("1x" = x, "1y" = y) %>% mutate(choice = paste0(1, choice)) %>% rowwise() %>% mutate(derived = ifelse(choice %in% colnames(df), eval(parse(text = choice)), NA)) ```

Also, I shudder whenever I see non-standard evaluation (`eval`, `parse`) inside unrelated non-standard evaluation (`mutate`, `%>%`). Not saying it’s a bad habit, but it’s one that’s bit me hard in the past.

1. John Mount says:

I try to avoid NSE unless I really need something from it that I can’t get any other way. I also consider setting the column names to something safe an important early data ingest step, so I have no problem with code that expects safe names.

2. Antoine Fabri says:

The function `get0` could be useful here, we could use it to simplify George’s answer :

```df %>% rowwise() %>% mutate(derived = get0(choice, ifnotfound = NA)) ```

but unfortunately it’s not robust because `z` could exist in the global environment. `get0` has an `inherits` Boolean argument that we can use here but environments are tricky under mutate, so better use `pmap` as it’s recommended over `rowwise` anyway.

```df\$derived <- pmap_dbl(df, ~with(list(...), get0(choice, ifnotfound = NA,inherits = FALSE)))```

``` # or explicitly : ```

```df\$derived <- pmap_dbl(df, function(x,y,choice) get0(choice, ifnotfound = NA,inherits = FALSE)) ```

For an idiomatic `tidyverse` way I would do:

```df <- rowid_to_column(df) df %>% gather(,derived,x,y) %>% filter(choice==key) %>% left_join(df3,.) %>% select(-key, -rowid) ```

if there was not this annoying `NA` we could just do

1. Antoine Fabri says:

ooops sorry for last of first post, I submitted too fast :)

1. Antoine Fabri says:

And not df3 but df…. sorry for the cluttered comments

2. John Mount says:

Not your fault, the comment system is awful and causes a lot of problems. Thanks for your points!

3. Antoine says:

Thanks to you, I’m enjoying this blog a lot

3. Antoine Fabri says:

In base R that would be:

```df\$derived <- do.call(mapply,c(df,FUN = function(...) with(list(...),get0(choice,ifnotfound = NA, inherits = FALSE)))) ```

4. Tyler Rinker says:

I’m sure there’s better ways but my thought with dplyr was to treat it as a lookup problem with joins like so:

```df %>%
mutate(id = row_number()) %>%
{
x <- .

left_join(x,
x %>%
select(-choice) %>%
tidyr::gather(choice, outcome, x:y),
by = c('choice', 'id')
)

}
```
1. John Mount says:

Looks like the comment system damaged your code (sorry). But this looks like the type of solution people have been discussing.

5. Erich Studerus says:

This is my dplyr solution.

df %>%
rowwise() %>%
mutate(derived = c(x, y)[match(choice, colnames(.))])

1. Erich Studerus says:

Actually, the rowwise is not even necessary for my solution.

1. Erich Studerus says:

Oops. Sorry, it is necessary. Ignore my previous comment

2. John Mount says:

Actually it is interesting to look at what comes out with the `rowwise()` removed. I get the following.

It is wrong, but I am unclear where it gets the 2 from.

```library("dplyr")
packageVersion("dplyr")
#  ‘0.7.6’

df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
choice = c("x", "y", "x", "z"),
stringsAsFactors = FALSE)
df %>%
mutate(derived = c(x, y)[match(choice, colnames(.))])

#   x y choice derived
# 1 1 5      x       1
# 2 2 6      y       2
# 3 3 7      x       1
# 4 4 8      z      NA
```