We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.
Let’s say we have some example data:
df <- data.frame(x = c(1, 2, 3, 4), y = c(5, 6, 7, 8), choice = c("x", "y", "x", "z"), stringsAsFactors = FALSE)
Suppose our goal is to build a new column called “derived” that for each row chooses from the column named by the names in the column called “choice”. In our case the result would be “[1, 6, 3, NA]” (the last value being “NA” as “z” is not a column of df).
How does one do this?
In base-R we can use a matrix-index to specify the set of data cells we want (please see “help([
)” for details):
dtmp <- df[, intersect(df$choice, colnames(df)), drop = FALSE] dtmp[ cbind(seq_len(nrow(df)), match(df$choice, colnames(dtmp)))] # [1] 1 6 3 NA
data.table has some known notation and solutions for this (please see “help(.SD)” for details):
library("data.table") dt <- as.data.table(df) dt[, derived := .SD[[choice]], by = choice][] # x y choice derived # 1: 1 5 x 1 # 2: 2 6 y 6 # 3: 3 7 x 3 # 4: 4 8 z NA
What we are doing is grouping by “choice”. In each group the choice column is a constant, and data.table supplies the name of the group as a scalar in the variable named choice. This lets us assign values from that column all at once (.SD, being data.table’s notation for the sub-data.table representing the group). Some discussion of the ideas can be found here.
What is going on: is we are reducing the choice column to a series of scalars- so we can then use the scalar column selection operators (such as “[[]]”) to access columns.
There is also a dplyr solution:
df %>% mutate(c2 = ifelse(choice %in% colnames(df), choice, colnames(df)[[1]])) %>% group_by(c2) %>% mutate(derived = .data[[ c2[[1]] ]]) %>% ungroup() %>% mutate(derived = ifelse(choice %in% colnames(df), derived, NA)) %>% select(-c2) # # A tibble: 4 x 4 # x y choice derived # <dbl> <dbl> <chr> <dbl> # 1 1 5 x 1 # 2 2 6 y 6 # 3 3 7 x 3 # 4 4 8 z NA
It is essentially the same trick as the data.table solution: break the problem into groups where the choice is not varying (allowing us to use scalar column specifiers to do the work). There is some extra care to deal with mis-matches. Also, the “[[1]]” is converting the column of choice values (all of which are the same due to the grouping by choice) into a scalar. We call this trick a “pseudo aggregator”, and we often teach the technique in SQL (where it is often min() or max()).
We have wrapped base-R and data.table implementations into method a in rquery/rqdatatable here. It works like this:
library("rqdatatable") lookup_by_column(df, "choice", "derived") # x y choice derived # 1: 1 5 x 1 # 2: 2 6 y 6 # 3: 3 7 x 3 # 4: 4 8 z NA
Also, rquery can perform this operation on database tables! (In database terms this looks a lot like a variation of a controlled coalesce.)
These are some ideas that will eventually be expanded and make it into our theory of coordinatized data (please also see here) and Practical Data Science in R 2nd Edition.
Edit: I know nobody asked: but we now have timings!
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
My dplyr solution will be:
This won’t work if the column names aren’t valid names for R variables.
df %>%
rename("1x" = x, "1y" = y) %>%
mutate(choice = paste0(1, choice)) %>%
rowwise() %>%
mutate(derived = ifelse(choice %in% colnames(df),
eval(parse(text = choice)),
NA))
Also, I shudder whenever I see non-standard evaluation (
eval
,parse
) inside unrelated non-standard evaluation (mutate
,%>%
). Not saying it’s a bad habit, but it’s one that’s bit me hard in the past.I try to avoid NSE unless I really need something from it that I can’t get any other way. I also consider setting the column names to something safe an important early data ingest step, so I have no problem with code that expects safe names.
The function
get0
could be useful here, we could use it to simplify George’s answer :df %>% rowwise() %>%
mutate(derived = get0(choice, ifnotfound = NA))
but unfortunately it’s not robust because
z
could exist in the global environment.get0
has aninherits
Boolean argument that we can use here but environments are tricky under mutate, so better usepmap
as it’s recommended overrowwise
anyway.df$derived <-
pmap_dbl(df, ~with(list(...), get0(choice, ifnotfound = NA,inherits = FALSE)))
# or explicitly :
df$derived <-
pmap_dbl(df, function(x,y,choice) get0(choice, ifnotfound = NA,inherits = FALSE))
For an idiomatic
tidyverse
way I would do:df <- rowid_to_column(df)
df %>%
gather(,derived,x,y) %>%
filter(choice==key) %>%
left_join(df3,.) %>%
select(-key, -rowid)
if there was not this annoying
NA
we could just doooops sorry for last of first post, I submitted too fast :)
And not df3 but df…. sorry for the cluttered comments
Not your fault, the comment system is awful and causes a lot of problems. Thanks for your points!
Thanks to you, I’m enjoying this blog a lot
In base R that would be:
df$derived <-
do.call(mapply,c(df,FUN = function(...)
with(list(...),get0(choice,ifnotfound = NA, inherits = FALSE))))
I’m sure there’s better ways but my thought with dplyr was to treat it as a lookup problem with joins like so:
Looks like the comment system damaged your code (sorry). But this looks like the type of solution people have been discussing.
This is my dplyr solution.
df %>%
rowwise() %>%
mutate(derived = c(x, y)[match(choice, colnames(.))])
Actually, the rowwise is not even necessary for my solution.
Oops. Sorry, it is necessary. Ignore my previous comment
Actually it is interesting to look at what comes out with the
rowwise()
removed. I get the following.It is wrong, but I am unclear where it gets the 2 from.