R often uses a concept of
factors to re-encode strings. This can be too early and too aggressive. Sometimes a string is just a string.
It is often claimed Sigmund Freud said “Sometimes a cigar is just a cigar.”
To avoid problems delay re-encoding of strings by using
stringsAsFactors = FALSE when creating
d <- data.frame(label = rep("tbd", 5)) d$label[] <- "north" #> Warning in `[[<-.factor`(`*tmp*`, 2, value = structure(c(1L, NA, 1L, 1L, : #> invalid factor level, NA generated print(d) #> label #> 1 tbd #> 2 <NA> #> 3 tbd #> 4 tbd #> 5 tbd
Notice our new value was not copied in!
The fix is easy: use
stringsAsFactors = FALSE.
d <- data.frame(label = rep("tbd", 5), stringsAsFactors = FALSE) d$label[] <- "north" print(d) #> label #> 1 tbd #> 2 north #> 3 tbd #> 4 tbd #> 5 tbd
As is often the case: base
R works okay in default mode and works very well if you judiciously change a few defaults. There is much less need to whole-hog replace
R functionality than some claim.
Note: the above pattern of pre-building a
data.frame and filling values by addressing row/column index sets is a very effective (and under appreciated) way to build up data (often easier and quicker than binding rows or columns).
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.