Another R tip. Get in the habit of using drop = FALSE
when indexing (using [ , ]
on) data.frame
s.
Prince Rupert’s drops (img: Wikimedia Commons)
In R, single column data.frame
s are often converted to vectors when manipulated. For example:
d <- data.frame(x = seq_len(3)) print(d) #> x #> 1 1 #> 2 2 #> 3 3
# not a data frame! d[order(-d$x), ] #> [1] 3 2 1
We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ]
change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,]
is also vector in this case.
The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame
named “g
” supplied as an argument: g[vec, ]
can be a data.frame
or a vector
(or even possibly a list
). However we do know if g
is a data.frame
then g[vec, , drop = FALSE]
is also a data.frame
(assuming vec
is a vector of valid row indices or a logical
vector, note: NA
induces some special cases).
We care as vector
s and data.frame
s have different semantics, so are not fully substitutable in later code.
The fix is to include drop = FALSE
as a third argument to [ , ]
.
# is a data frame. d[order(-d$x), , drop = FALSE] #> x #> 3 3 #> 2 2 #> 1 1
To pull out a column I suggest using one of the many good extraction notations (all using the fact a data.frame
is officially a list of columns):
d[["x"]] #> [1] 1 2 3 d$x #> [1] 1 2 3 d[[1]] #> [1] 1 2 3
My overall advice is: get in the habit of including drop = FALSE
when working with [ , ]
and data.frame
s. I say do this even when it is obvious that the result does in fact have more than one column.
For example write “mtcars[, c("mpg", "cyl"), drop = FALSE]
” instead of “mtcars[, c("mpg", "cyl")]
“. It is clear that for data.frame
s both forms should work the same (either selecting a data frame with two columns, or throwing an error if we have mentioned a non existent column). But longer drop = FALSE
form is safer (go further towards ensuring type stable code) and more importantly documents intent (that you wanted a data.frame
result).
One can also try base::subset(), as it has non-dropping defaults.
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
R
(like it or hate it) is a very irregular language (things change a lot based on context, and there are many corner cases). Most of these tips are trying to show how to try to retreat to a more regular subset of the language. This is for safety and later readability, at some sacrifice of convenience. Often there will be exceptions that defeat even the attempt to retreat. Even in these cases the suggestions are no worse than the more common notations, so they really do not fully fail.For pedants.
R
has two notions of what is generically called object type:class()
andtypeof()
. The above is unstable in both senses as the class changes fromdata.frame
tointeger
and the type changes fromlist
(rememberdata.frame
s are lists of columns) tointeger
. Obviously there is a strong relation between vectors and lists inR
, but they do not behave the same.Overall the tips are attempting to be short and clear, whereas this comment is attempting to be more complete (and a great cost to length and clarity).
The Prince Rupert’s drop image is supposed to evoke a feeling of system under pressure experiencing a catastrophic phase-change, or failure, when you break one small thing (the tail of the drop, or expected object class).
This is probably the most useful bit of R code I have learned all year. Thank you for that!
I’d like to point out that tibbles behave with
drop = FALSE
by default (alsostringsAsFactors = FALSE
by default). This behavior prevented some bugs in my code. Furthermore, you can’t specifydrop = TRUE
, you have to use one of the alternative notations that notations.tibbles
have some nice features (such as implicitdrop = FALSE
,stringsAsFactors = FALSE
, and smaller row limits on printing). But there are some costs: idiosyncraticpillar
number formatting (which seems to be still re-debating basic issues of significant digits, and rounding), no row names (a matter of opinion, but a problem if you actually needed such), and directC
implementation (so you are depending on both theR
interpreter and thetibble
C
to be safe, correct, and in sync with each other).When doing a subset of columns from a data frame, I tend to prefer the
mtcars[c("am", "mpg")]
approach (notice the lack of[,]
notation). This circumvents the need fordrop = FALSE
, but is only applicable when subsetting only on columns.If subsetting on rows, this is a very sound tip.
That is darn good advice, thanks!
I had to try this:
Sure enough, the second statement gives you a vector. A very ugly corner case, and one that I know has caused bugs even for seasoned “full-time” R programmers
I apologize, WordPress mangled your comment a bit (and I tried to fix it). I assume you man this in context of something
df <- data.frame(x = 1:10)
.