Menu Home

For loops in R can lose class information

Did you know R‘s for() loop control structure drops class annotations from vectors?

Consider the following code R code demonstrating three uses of a for-loop that one would expect to behave very similarly.

dates <- c(as.Date('2015-01-01'),as.Date('2015-01-02'))

for(ii in seq_along(dates)) {
  di <- dates[ii]
  print(di)
}
## [1] "2015-01-01"
## [1] "2015-01-02"

for(di in as.list(dates)) {
  print(di)
}
## [1] "2015-01-01"
## [1] "2015-01-02"

for(di in dates) {
  print(di)
}
## [1] 16436
## [1] 16437

Notice in the third for loop the di print as numbers. This is because running through the dates in this way loses the class annotations. To me this is a huge undesirable surprise (given that indexing does not lose class information). Remember with the class information missing many more behaviors that just printing may be broken. The third loop is the “most natural” as it doesn’t introduce an index or re-process the vector prior to iterating. But the third loop seems to not be safe to use with code that depends on class annotations being preserved.

The work arounds are shown prior to the failure (introducing an index or converting the vector into a list).

Also notice vapply() also loses the class info, even when you explicitly supply it:

vapply(dates,function(x) {x+0},as.Date(0))
## [1] 16436 16437

I understand the vapply() case, as is.vector(dates) is false (due to the class annotation) and one would expect to return something that has is.vector() true. But the for-loop behavior is a real head-scratcher that took a while to believe was actually happening when a partner ran into it.

For a lot of languages for(di in dates) { ... } is roughly syntactic sugar for for(ii in seq_along(dates)) { di <- dates[ii]; ... } (assuming ii is not used elsewhere in the code, i.e. we have a so-called “hygienic” substitution). So it is a big surprise that these two code fragments behave so differently in R. It also make R a bit harder to teach.

I didn’t anticipate these issues from R’s basic documentation which says:

For each element in vector the variable name is set to the value of that element and statement1 is evaluated.

I am guessing this is to be read with the additional understanding that dates[i] is not the just value in the i-th position of dates but in fact a length-1 vector with the same class annotation as dates containing that value (so think of [] is performing extra work to copy over this information). Of course the for loop can’t be working only through “naked” values as R doesn’t expose scalars to user code, therefore any values a for-loop is iterating over must also appear re-wrapped in a vector structure.

If anybody has some good teaching material on this I’d love to see it.

Categories: Uncategorized

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

9 replies

  1. The index has to be numeric and you are correct that it will not just assume that you really meant to start with 1 and go to the last one, which actually is quite useful at times such as when you don’t want to do that. So it is converting the dates to a number. I guess it doesn’t warn you because people might be doing this on purpose.

    for(di in dates) {
    # Convert it back to a date.
    a <- as.Date(di, origin = "1960-01-02")
    print(a)
    }

    or you could not treat the date value as the index and instead treat the dates index as the index.

    for(di in 1:length(dates)) {
    print(dates[di])
    }

    1. Thanks Elin. I thought the “seq_along()” would be a safer way to move to indexes (as 1:length(dates) doesn’t work when dates is zero length). The converting back seems too dangerous to rely on in production.

  2. I am guessing all of this is an artifact of the S3 class system entering the S/R ecosystem after the for-loop was already established. Its funny in Java the collection type for-loop entered later, so it was defined to emulate the index style for-loop semantics.

  3. ifelse() has the same problem, I believe:
    ifelse(dates == as.Date(“2015-01-01”), dates, dates + 1)
    [1] 16436 16438

    1. Thanks, ifelse() is pretty interesting. If you look at the code (print(ifelse)) you can see it goes to great lengths to stuff the answers into a vector that is built from the test-vector using explicit commands (such as as.logical()) to strip off attributes (such as class).

  4. Some explicit “for”{} overcomes this problem. For some case, the problem is tha class procedure itself! Then a wrapped function is evident.

  5. I am guessing the issue is that in the S3 object system the value c(as.Date('2015-01-01'),as.Date('2015-01-02')) “should” be interpreted as a single object of class “Date” that happens to contain two values in its implementation, and not as two Date values in a vector (similar for POSIXct).

    For example in defining a class such “PlaneCoordinate” as a class over numeric vectors it might make sense to expect such objects are always implemented over vectors of length 2, and splitting them up doesn’t “isn’t sensible.”

    This is sort of following the principle that in object oriented programming objects tend to preserve invariants (you tend not to rip into their member slots directly).

    However, the c(), [], and [[]] operators clearly all treat such vectors as vectors of objects (not as a single object encapsulating a vector of values). So it is a bit inconsistent for for() to have the opposite interpretation.

    Likely some code breaks if we were to wish in a new behavior for for().

  6. Oddly enough, this doesn’t happen with factors, which should have the same problem:

    > tmp = as.factor(c('a', 'b', 'c'))
    > for(t in tmp) print(t)
    [1] "a"
    [1] "b"
    [1] "c"

    > class(tmp)
    [1] "factor"
    > typeof(tmp)
    [1] "integer"

    Likely, it caused enough problems that they special-cased factors.

%d bloggers like this: