Menu Home

Use Pseudo-Aggregators to Add Safety Checks to Your Data-Wrangling Workflow

One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

  • Which columns are keys (together identify rows or records).
  • Which columns are data/payload (are considered free varying data).
  • Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

In mathematics the statement "y is a function of x" merely means there is an ideal lookup table with which if you knew the value of x, then you could in principle know the value of y.

For example in the following R data.frame: y is a function of x, as all rows that have the same value for x also have the same value for y.

d1 <- data.frame(
  x = c("a", "a", "b", "b", "c"),
  y = c(10, 10, 20, 20, 20), 
  z = c(1, 2, 1, 1, 1),
  stringsAsFactors = FALSE)
print(d1)
##   x  y z
## 1 a 10 1
## 2 a 10 2
## 3 b 20 1
## 4 b 20 1
## 5 c 20 1

Notice if we know the value of x we then, in principle know the value of y. In the same example z is not a function of x, as it does not have this property.

A more concrete example would be: user-name is a function of user-ID. If you know the an individual’s user-ID, then you also (if you have the right lookup table) know the individual’s user-name.

We first taught these concepts in the context of SQL and SQL grouped aggregation. In SQL once you aggregate on one column, then all other columns in your query must either be the grouping columns, or also aggregated. This is easiest to show in code, but we will use the dplyr package for our example.

library("dplyr")

d1 %>%
  group_by(x) %>%
  summarize(y = max(y)) %>%
  ungroup()
## # A tibble: 3 x 2
##   x         y
##   <chr> <dbl>
## 1 a        10
## 2 b        20
## 3 c        20

Notice only grouping columns and columns passed through an aggregating calculation (such as max()) are passed through (the column z is not in the result). Now because y is a function of x no substantial aggregation is going on, we call this situation a "pseudo aggregation" and we have taught this before. This is also why we made the seemingly strange choice of keeping the variable name y (instead of picking a new name such as max_y), we expect the y values coming out to be the same as the one coming in- just with changes of length. Pseudo aggregation (using the projection y[[1]]) was also used in the solutions of the column indexing problem.

Our wrapr package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): psagg(). It works as follows.

library("wrapr")

d1 %>%
  group_by(x) %>%
  summarize(y = psagg(y)) %>%
  ungroup()
## # A tibble: 3 x 2
##   x         y
##   <chr> <dbl>
## 1 a        10
## 2 b        20
## 3 c        20

psagg() pretty much worked the same as the earlier max(). However, it documents our belief that y is a function of x (that nothing interesting is going on for this column during aggregation). Where psagg() differs is if our assumption that y is a function of x is violated.

d2 <- data.frame(
  x = c("a", "a", "b", "b", "c"),
  y = c(10, 10, 20, 23, 20), 
  stringsAsFactors = FALSE)
print(d2)
##   x  y
## 1 a 10
## 2 a 10
## 3 b 20
## 4 b 23
## 5 c 20
d2 %>%
  group_by(x) %>%
  summarize(y = psagg(y)) %>%
  ungroup()
## Error in summarise_impl(.data, dots): Evaluation error: wrapr::psagg argument values are varying.

The code caught that our assumption was false and raised on error. This sort of checking can save a lot of time and prevent erroneous results.

And that is part of what we teach:

  • document intent
  • check assumptions
  • double-check results
  • use composable small helper-tools.

psagg() also works well with data.table.

library("data.table")

as.data.table(d1)[
  , .(y = psagg(y)), by = "x"]
##    x  y
## 1: a 10
## 2: b 20
## 3: c 20
as.data.table(d2)[
  , .(y = psagg(y)), by = "x"]
## Error in psagg(y): wrapr::psagg argument values are varying

Of course we don’t strictly need psagg() as we could insert checks by hand (though this would become burdensome if we had many derived columns).

as.data.table(d1)[
  , .(y = max(y), 
      y_was_const = min(y)==max(y)), 
  by = "x"]
##    x  y y_was_const
## 1: a 10        TRUE
## 2: b 20        TRUE
## 3: c 20        TRUE
as.data.table(d2)[
  , .(y = max(y), 
      y_was_const = min(y)==max(y)), 
  by = "x"]
##    x  y y_was_const
## 1: a 10        TRUE
## 2: b 23       FALSE
## 3: c 20        TRUE

Unfortunately, this sort of checking does not currently work for dplyr.

packageVersion("dplyr")
## [1] '0.7.7'
d1 %>%
  group_by(x) %>%
  summarize(y = max(y),
            y_was_const = min(y)==max(y)) %>%
  ungroup()
## # A tibble: 3 x 3
##   x         y y_was_const
##   <chr> <dbl> <lgl>      
## 1 a        10 TRUE       
## 2 b        20 TRUE       
## 3 c        20 TRUE
d2 %>%
  group_by(x) %>%
  summarize(y = max(y),
            y_was_const = min(y)==max(y)) %>%
  ungroup()
## # A tibble: 3 x 3
##   x         y y_was_const
##   <chr> <dbl> <lgl>      
## 1 a        10 TRUE       
## 2 b        23 TRUE       
## 3 c        20 TRUE

Notice the per-group variation in y was not detected. This appears to be a dplyr un-caught result corruption issue. I fully get that it is odd to run into an error during a checking step, but the checking step did not in fact introduce problems- it is merely failing to catch them. We run checks like this because the data (possibly from an external source) may not be quite structured they way we were told it is.

If we don’t attempt to re-use the variable name we get the correct result.

d2 %>%
  group_by(x) %>%
  summarize(y_for_group = max(y),
            y_was_const = min(y)==max(y)) %>%
  ungroup()
## # A tibble: 3 x 3
##   x     y_for_group y_was_const
##   <chr>       <dbl> <lgl>      
## 1 a              10 TRUE       
## 2 b              23 FALSE      
## 3 c              20 TRUE

However, being forced to rename variables during aggregation is a needless user burden.

Our seplyr package (a package that is a very thin adapter on top of dplyr) does issue a warning in the failing situation.

library("seplyr")

d2 %>%
  group_by_se("x") %>%
  summarize_nse(y = max(y),
                y_was_const = min(y)==max(y)) %>%
  ungroup()
## Warning in summarize_se(res, summarizeTerms, warn = summarize_nse_warn, :
## seplyr::summarize_se possibly confusing column name re-use c('y' =
## 'max(y)', 'y_was_const' = 'min(y) == max(y)')

## # A tibble: 3 x 3
##   x         y y_was_const
##   <chr> <dbl> <lgl>      
## 1 a        10 TRUE       
## 2 b        23 TRUE       
## 3 c        20 TRUE

The result is still wrong, but at least in this situation the user has a better chance at noticing and working around the issue.

The problem may not be common (it may or may not be in any of your code, or code you use), and is of course easy to avoid (once you know the nature of the issue).

Categories: Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. The dplyr issue is probably not a user mis-understanding of what values are available when. Notice in this example we see a value that is not the expected value of the x column for any notion of before/after:

    dplyr::summarize(data.frame(x = 1), x = max(x), min_x = min(x))
    #   x min_x
    # 1 1     0
    

    One can apparently fix things by making the name-reuse last. However, with no clear statement on what the column summarize() semantics are that seems risky. I believe mutate() is documented as have left-to right semantics on in-memory data.frames (perhaps implied by the “Note that you can refer to columns that you’ve just created” section of R for Data Science), but it currently does not have such in databases.

    library("dplyr")
    
    db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
    
    d_local <- data.frame(x = 1)
    
    d_local %>% 
      mutate(y = 1, y = y + 1, y = y + 1)
    #  x y
    #1 1 3
    
    d_remote <- dplyr::copy_to(db, d_local, "d_remote")
    
    d_remote %>% 
      mutate(y = 1, y = y + 1, y = y + 1)
    # # Source:   lazy query [?? x 2]
    # # Database: sqlite 3.22.0 [:memory:]
    #       x     y
    #    
    # 1     1     2
    
    library("seplyr")
    
    d_remote %>% 
      mutate_nse(y = 1, y = y + 1, y = y + 1)
    # Source:   lazy query [?? x 2]
    # Database: sqlite 3.22.0 [:memory:]
    #       x     y
    #    
    # 1     1     3
    # Warning message:
    # In mutate_se(res, mutateTerms, splitTerms = mutate_nse_split_terms,  :
    #   seplyr::mutate_se possible name conflicts in assigment
    
    DBI::dbDisconnect(db)
    

    Notice how for dplyr the local and remote answers differ. In this case seplyr not only re-factors the mutate to have reliable left to right semantics, it also issues a warning.

  2. sequence matters!!

    packageVersion(“dplyr”)

    [1] ‘0.7.6’

    d1 <- data.frame(
    x = c(“a”, “a”, “b”, “b”, “c”),
    y = c(10, 10, 20, 23, 20),
    z = c(1, 2, 1, 1, 1),
    stringsAsFactors = FALSE)

    d1 %>%
    group_by(x) %>%
    summarize(y_was_const = min(y)==max(y),
    y = max(y)) %>%
    ungroup()

    # A tibble: 3 x 3

    x y_was_const y

    1 a TRUE 10

    2 b FALSE 23

    3 c TRUE 20

    just create y_was_const before summarizing max(y) can do the check.
    :)

    1. Yes, but a reliable package would warn a new user against such situations (i.e. if the code in the “wrong” order is incorrect it would reject it). Neither SQL nor data.table have this problem as they both define all expression terms to unambiguously refer to columns prior to any transform.

%d