Menu Home

R Tip: Use stringsAsFactors = FALSE

R tip: use stringsAsFactors = FALSE.

R often uses a concept of factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.

800px Sigmund Freud by Max Halberstadt cropped

It is often claimed Sigmund Freud said “Sometimes a cigar is just a cigar.”

To avoid problems delay re-encoding of strings by using stringsAsFactors = FALSE when creating data.frames.

Example:

d <- data.frame(label = rep("tbd", 5))

d$label[[2]] <- "north"
#> Warning in `[[<-.factor`(`*tmp*`, 2, value = structure(c(1L, NA, 1L, 1L, :
#> invalid factor level, NA generated

print(d)
#>   label
#> 1   tbd
#> 2  <NA>
#> 3   tbd
#> 4   tbd
#> 5   tbd

Notice our new value was not copied in!

The fix is easy: use stringsAsFactors = FALSE.

d <- data.frame(label = rep("tbd", 5),
                stringsAsFactors = FALSE)

d$label[[2]] <- "north"

print(d)
#>   label
#> 1   tbd
#> 2 north
#> 3   tbd
#> 4   tbd
#> 5   tbd

As is often the case: base R works okay in default mode and works very well if you judiciously change a few defaults. There is much less need to whole-hog replace R functionality than some claim.

Note: the above pattern of pre-building a data.frame and filling values by addressing row/column index sets is a very effective (and under appreciated) way to build up data (often easier and quicker than binding rows or columns).

Categories: Coding Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. Great article. Completely agree.
    This comment is just as an aside that the new factor level is automatically added for you in data.table.
    As you know, stringsAsFactors=FALSE is the default in data.table for 10 years. So to demonstrate this feature of a factor column, we first need to set it to TRUE :

      DT = data.table(label = rep("tbd", 5), stringsAsFactors=TRUE)
      DT
          label
         
      1:    tbd
      2:    tbd
      3:    tbd
      4:    tbd
      5:    tbd
      
      DT[2, label:="north"]
      DT
          label
         
      1:    tbd
      2:  north
      3:    tbd
      4:    tbd
      5:    tbd
      DT$label
      [1] tbd   north tbd   tbd   tbd
      Levels: tbd north
    

    The point is just that it added in the new factor level automatically for you, whereas in base R that’s an error. I agree most of the time plain character type is probably best, but I’m just adding minor information that if you do have a factor (sometimes a factor is better when modelling, and ordered factors are also sometimes useful) then := in data.table copes with new factor levels.

    It’s one convenience/ease-of-use feature of data.table that is nothing to do with size or speed.

    1. Thanks, Matt. My impression has been that data.table is definitely designed with actual production use very much in mind.

  2. And for anyone who has ever had to deal with the frustration of factors, a very cathartic way to implement this tip is

    devtools::install_github("nutterb/sillylogic")

    d <- data.frame(label = rep("tbd", 5),
    stringsAsFactors = HELLNO)

%d bloggers like this: