R is a powerful data science language because, like Matlab, numpy, and Pandas, it exposes vectorized operations. That is, a user can perform operations on hundreds (or even billions) of cells by merely specifying the operation on the column or vector of values.
Of course, sometimes it takes a while to figure out how to do this. Please read for a great R matrix lookup problem and solution.
In R we can specify operations over vectors. For arithmetic this is easy, but some more complex operations you “need to know the trick.”
Patrick Freeman (@PTFreeman) recently asked: what is the idiomatic way to look up a bunch of values from a matrix by row and column keys? This is actually easy to do if we first expand the data matrix into RDF-triples. If our data were in this format we could merge/join it against our desired column indices.
Let’s start with an example data matrix.
# example matrix data
m <- matrix(1:9, nrow = 3)
row.names(m) <- c('R1' ,'R2', 'R3')
colnames(m) <- c('C1', 'C2', 'C3')
knitr::kable(m)
C1 | C2 | C3 | |
---|---|---|---|
R1 | 1 | 4 | 7 |
R2 | 2 | 5 | 8 |
R3 | 3 | 6 | 9 |
And our data-frame containing the indices we want to look-up.
# row/columns we want
w <- data.frame(
i = c('R1', 'R2', 'R2'),
j = c('C2', 'C3', 'C2'))
knitr::kable(w)
i | j |
---|---|
R1 | C2 |
R2 | C3 |
R2 | C2 |
That is: we want to know the matrix values from [R1, C2], [R2, C3], and [R2, C2].
The trick is: how do we convert the matrix into triples? digEmAll, has a great solution to that here.
# unpack into 3-column format from:
# https://stackoverflow.com/a/9913601
triples <- data.frame(
i = rep(row.names(m), ncol(m)),
j = rep(colnames(m), each = nrow(m)),
v = as.vector(m))
knitr::kable(triples)
i | j | v |
---|---|---|
R1 | C1 | 1 |
R2 | C1 | 2 |
R3 | C1 | 3 |
R1 | C2 | 4 |
R2 | C2 | 5 |
R3 | C2 | 6 |
R1 | C3 | 7 |
R2 | C3 | 8 |
R3 | C3 | 9 |
What the above code has done is: write each entry of the original matrix as a separate row with the original row and column ids landed as new columns. This data format is very useful.
The above code is worth saving as a re-usable snippet, as getting it right is a clever step.
Now we can vectorize our lookup using the merge command, which produces a new joined table where the desired values have been landed as a new column.
res <- merge(w, triples, by = c('i', 'j'), sort = FALSE)
knitr::kable(res)
i | j | v |
---|---|---|
R1 | C2 | 4 |
R2 | C3 | 8 |
R2 | C2 | 5 |
And that is it: we have used vectorized and relational concepts to look up many values from a matrix very quickly.
Categories: Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
You can also use the special case of subsetting an array by another array:
m[as.matrix(w)]
# [1] 4 8 5
This is the syntax that inspired the subsetting syntax for the data.table package (https://rdatatable.gitlab.io/data.table/reference/data.table.html).
That is really neat. Also I see it preserves row order better than my merge solution (when I forgot the
sort = FALSE
option, now fixed). Thanks!