Menu Home

Don’t use stats::aggregate()

When working with an analysis system (such as R) there are usually good reasons to prefer using functions from the “base” system over using functions from extension packages. However, base functions are sometimes locked into unfortunate design compromises that can now be avoided. In R’s case I would say: do not use stats::aggregate().

Read on for our example.

For our example we create a data frame. The issue is: I am working in the Pacific time zone on Saturday October 31st 2015, and I have some time data that I want to work with that is in an Asian time zone.

print(date())
## [1] "Sat Oct 31 08:14:38 2015"
d <- data.frame(group='x',
 time=as.POSIXct(strptime('2006/10/01 09:00:00',
   format='%Y/%m/%d %H:%M:%S',
   tz="Etc/GMT+8"),tz="Etc/GMT+8"))  # I'd like to say UTC+8 or CST
print(d)
##   group                time
## 1     x 2006-10-01 09:00:00
print(d$time)
## [1] "2006-10-01 09:00:00 GMT+8"
str(d$time)
##  POSIXct[1:1], format: "2006-10-01 09:00:00"
print(unclass(d$time))
## [1] 1159722000
## attr(,"tzone")
## [1] "Etc/GMT+8"

Suppose I try to aggregate the data to find the earliest time for each group. I have a problem, aggregate loses the timezone and gives a bad answer.

d2 <- aggregate(time~group,data=d,FUN=min)
print(d2)
##   group                time
## 1     x 2006-10-01 10:00:00
print(d2$time)
## [1] "2006-10-01 10:00:00 PDT"

This is bad. Our time has lost its time zone and changed from 09:00:00 to 10:00:00. This violates John M. Chambers’ “Prime Directive” that:

computations can be understood and trusted.
Software for Data Analysis, John M. Chambers, Springer 2008, page 3.

The issue is the POSIXct time time is essentially a numeric array carrying around its timezone as an attribute. Most base R code has problems if there are extra attributes on a numeric array. So R-stat code tends to have a habit of dropping attributes when it can. it is odd that the class() is kept (which itself an attribute style structure) and the timezone is lost, but R is full of hand-specified corner cases.

dplyr gets the right answer.

library('dplyr')
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
by_group = group_by(d,group)
d3 <- summarize(by_group,min(time))
print(d3)
## Source: local data frame [1 x 2]
## 
##   group           min(time)
## 1     x 2006-10-01 09:00:00
print(d3[[2]])
## [1] "2006-10-01 09:00:00 GMT+8"

And plyr also works.

library('plyr')
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
d4 <- ddply(d,.(group),summarize,time=min(time))
print(d4)
##   group                time
## 1     x 2006-10-01 09:00:00
print(d4$time)
## [1] "2006-10-01 09:00:00 GMT+8"

Categories: Pragmatic Data Science Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

7 replies

  1. It’s definitely nice to know about that issue! However, I’d say to just avoid this particular case with aggregate, mainly because aggregate is usually much faster than dplyr. In fact, we’ve compared aggregate with dplyr and data.table and found that data.table has the best performance, closely followed by aggregate and then a much slower dplyr. Yes, dplyr is probably the most comprehensive and yields correct results, but for large datasets it’s just a hassle to use. My advice: try data.table.

    1. Thanks! Always happy to get some free data.table advice. Below is some code showing it working perfectly:

      library('data.table')
      ## 
      ## Attaching package: 'data.table'
      ## 
      ## The following objects are masked from 'package:dplyr':
      ## 
      ##     between, last
      d5 <- as.data.frame(data.table(d)[,min(time),by = group])
      print(d5)
      ##   group                  V1
      ## 1     x 2006-10-01 09:00:00
      str(d5)
      ## 'data.frame':    1 obs. of  2 variables:
      ##  $ group: Factor w/ 1 level "x": 1
      ##  $ V1   : POSIXct, format: "2006-10-01 09:00:00"
      print(unclass(d5[[2]]))
      ## [1] 1159722000
      ## attr(,"tzone")
      ## [1] "Etc/GMT+8"
  2. Still surprised about the claim that aggregate is faster than dplyr. Do you have any benchmark ?
    In my own experience, it’s quite the opposite may be there is something I do wrong with aggregate

    “`{r}
    library(dplyr)
    n <- 1e8
    m <- 1e4
    set.seed(1)
    d <- data.frame(x = sample(m, n, replace=TRUE), y = runif(n))

    system.time(d1 % group_by(x) %>% summarize(y = mean(y)))
    ## user system elapsed
    ## 6.463 0.156 6.619

    system.time(d2 <- aggregate(y ~ x, FUN = mean, data = d))
    ## user system elapsed
    ## 128.854 11.056 140.064

    all.equal(as.data.frame(d1), d2)
    ## [1] TRUE. Here is small example:
    “`

    1. Now that you give the example, I believe I might have confused plyr with dplyr. We did some tests internally in a company I used to work at last year, so I don’t have the code, but I’m almost sure I messed up plyr with dplyr. Thanks for sharing the benchmark!

  3. If you are using a timezone different from your default, you need to call something like `Sys.setenv(TZ=”Etc/GMT+8″)`, after which `aggregate` will give you the right answer.

    1. It may be as you say, but the execution time zone should not be confounded with the data time zone. There should be an invariance that all calculations work the same, regardless of frame of view. Though, yes- the numbers are the same and a lot of this is in the print formatting. The whole point of class and types is to allow some encapsulation.

      Also, the classic purpose of objected oriented classes (which the time types are) is to preserve invariants. If aggregate is going to lose the time zone it might as well also lose the class annotation as well (fully separating the numbers from a default time interpretation).

      And it isn’t just the print method that is damaged. as.character(d2$time) (which might be used in calculation) is also damaged.

%d bloggers like this: