Menu Home

R Tip: Use drop = FALSE with data.frames

Another R tip. Get in the habit of using drop = FALSE when indexing (using [ , ] on) data.frames.

NewImage

Prince Rupert’s drops (img: Wikimedia Commons)

In R, single column data.frames are often converted to vectors when manipulated. For example:

d <- data.frame(x = seq_len(3))
print(d)
#>   x
#> 1 1
#> 2 2
#> 3 3
# not a data frame!
d[order(-d$x), ]
#> [1] 3 2 1

We were merely trying to re-order the rows and the result was converted to a vector. This happened because the rules for [ , ] change if there is only one result column. This happens even if the there had been only one input column. Another example is: d[,] is also vector in this case.

The issue is: if we are writing re-usable code we are often programming before we know complete contents of a variable or argument. For a data.frame named “g” supplied as an argument: g[vec, ] can be a data.frame or a vector (or even possibly a list). However we do know if g is a data.frame then g[vec, , drop = FALSE] is also a data.frame (assuming vec is a vector of valid row indices or a logical vector, note: NA induces some special cases).

We care as vectors and data.frames have different semantics, so are not fully substitutable in later code.

The fix is to include drop = FALSE as a third argument to [ , ].

# is a data frame.
d[order(-d$x), , drop = FALSE]
#>   x
#> 3 3
#> 2 2
#> 1 1

To pull out a column I suggest using one of the many good extraction notations (all using the fact a data.frame is officially a list of columns):

d[["x"]]
#> [1] 1 2 3

d$x
#> [1] 1 2 3

d[[1]]
#> [1] 1 2 3

My overall advice is: get in the habit of including drop = FALSE when working with [ , ] and data.frames. I say do this even when it is obvious that the result does in fact have more than one column.

For example write “mtcars[, c("mpg", "cyl"), drop = FALSE]” instead of “mtcars[, c("mpg", "cyl")]“. It is clear that for data.frames both forms should work the same (either selecting a data frame with two columns, or throwing an error if we have mentioned a non existent column). But longer drop = FALSE form is safer (go further towards ensuring type stable code) and more importantly documents intent (that you wanted a data.frame result).

One can also try base::subset(), as it has non-dropping defaults.

Categories: Coding Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

8 replies

  1. R (like it or hate it) is a very irregular language (things change a lot based on context, and there are many corner cases). Most of these tips are trying to show how to try to retreat to a more regular subset of the language. This is for safety and later readability, at some sacrifice of convenience. Often there will be exceptions that defeat even the attempt to retreat. Even in these cases the suggestions are no worse than the more common notations, so they really do not fully fail.

    For pedants. R has two notions of what is generically called object type: class() and typeof(). The above is unstable in both senses as the class changes from data.frame to integer and the type changes from list (remember data.frames are lists of columns) to integer. Obviously there is a strong relation between vectors and lists in R, but they do not behave the same.

    Overall the tips are attempting to be short and clear, whereas this comment is attempting to be more complete (and a great cost to length and clarity).

    The Prince Rupert’s drop image is supposed to evoke a feeling of system under pressure experiencing a catastrophic phase-change, or failure, when you break one small thing (the tail of the drop, or expected object class).

    1. This is probably the most useful bit of R code I have learned all year. Thank you for that!

  2. I’d like to point out that tibbles behave with drop = FALSE by default (also stringsAsFactors = FALSE by default). This behavior prevented some bugs in my code. Furthermore, you can’t specify drop = TRUE, you have to use one of the alternative notations that notations.

    1. tibbles have some nice features (such as implicit drop = FALSE, stringsAsFactors = FALSE, and smaller row limits on printing). But there are some costs: idiosyncratic pillar number formatting (which seems to be still re-debating basic issues of significant digits, and rounding), no row names (a matter of opinion, but a problem if you actually needed such), and direct C implementation (so you are depending on both the R interpreter and the tibble C to be safe, correct, and in sync with each other).

      > data.frame(x = 10^(-1:2))
            x
      1   0.1
      2   1.0
      3  10.0
      4 100.0
      > tibble:::tibble(x = 10^(-1:2))
      # A tibble: 4 x 1
              x
          <dbl>
      1   0.100
      2   1.00 
      3  10.0  
      4 100    
      
  3. When doing a subset of columns from a data frame, I tend to prefer the mtcars[c("am", "mpg")] approach (notice the lack of [,] notation). This circumvents the need for drop = FALSE, but is only applicable when subsetting only on columns.

    If subsetting on rows, this is a very sound tip.

      1. I had to try this:

        df[5,]
        

        Sure enough, the second statement gives you a vector. A very ugly corner case, and one that I know has caused bugs even for seasoned “full-time” R programmers

      2. I apologize, WordPress mangled your comment a bit (and I tried to fix it). I assume you man this in context of something df <- data.frame(x = 1:10).

%d bloggers like this: