One of the concepts we teach in both Practical Data Science with R and in our theory of data shaping is the importance of identifying the roles of columns in your data.

For example, to think in terms of multi-row records it helps to identify:

• Which columns are keys (together identify rows or records).
• Which columns are data/payload (are considered free varying data).
• Which columns are "derived" (functions of the keys).

In this note we will show how to use some of these ideas to write safer data-wrangling code.

In mathematics the statement "y is a function of x" merely means there is an ideal lookup table with which if you knew the value of `x`, then you could in principle know the value of `y`.

For example in the following `R` `data.frame:` `y` is a function of `x`, as all rows that have the same value for `x` also have the same value for `y`.

``````d1 <- data.frame(
x = c("a", "a", "b", "b", "c"),
y = c(10, 10, 20, 20, 20),
z = c(1, 2, 1, 1, 1),
stringsAsFactors = FALSE)
print(d1)``````
``````##   x  y z
## 1 a 10 1
## 2 a 10 2
## 3 b 20 1
## 4 b 20 1
## 5 c 20 1``````

Notice if we know the value of `x` we then, in principle know the value of `y`. In the same example `z` is not a function of `x`, as it does not have this property.

A more concrete example would be: user-name is a function of user-ID. If you know the an individual’s user-ID, then you also (if you have the right lookup table) know the individual’s user-name.

We first taught these concepts in the context of `SQL` and `SQL` grouped aggregation. In `SQL` once you aggregate on one column, then all other columns in your query must either be the grouping columns, or also aggregated. This is easiest to show in code, but we will use the `dplyr` package for our example.

``````library("dplyr")

d1 %>%
group_by(x) %>%
summarize(y = max(y)) %>%
ungroup()``````
``````## # A tibble: 3 x 2
##   x         y
##   <chr> <dbl>
## 1 a        10
## 2 b        20
## 3 c        20``````

Notice only grouping columns and columns passed through an aggregating calculation (such as `max()`) are passed through (the column `z` is not in the result). Now because `y` is a function of `x` no substantial aggregation is going on, we call this situation a "pseudo aggregation" and we have taught this before. This is also why we made the seemingly strange choice of keeping the variable name `y` (instead of picking a new name such as `max_y`), we expect the `y` values coming out to be the same as the one coming in- just with changes of length. Pseudo aggregation (using the projection `y[]`) was also used in the solutions of the column indexing problem.

Our `wrapr` package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): `psagg()`. It works as follows.

``````library("wrapr")

d1 %>%
group_by(x) %>%
summarize(y = psagg(y)) %>%
ungroup()``````
``````## # A tibble: 3 x 2
##   x         y
##   <chr> <dbl>
## 1 a        10
## 2 b        20
## 3 c        20``````

`psagg()` pretty much worked the same as the earlier `max()`. However, it documents our belief that `y` is a function of `x` (that nothing interesting is going on for this column during aggregation). Where `psagg()` differs is if our assumption that `y` is a function of `x` is violated.

``````d2 <- data.frame(
x = c("a", "a", "b", "b", "c"),
y = c(10, 10, 20, 23, 20),
stringsAsFactors = FALSE)
print(d2)``````
``````##   x  y
## 1 a 10
## 2 a 10
## 3 b 20
## 4 b 23
## 5 c 20``````
``````d2 %>%
group_by(x) %>%
summarize(y = psagg(y)) %>%
ungroup()``````
``## Error in summarise_impl(.data, dots): Evaluation error: wrapr::psagg argument values are varying.``

The code caught that our assumption was false and raised on error. This sort of checking can save a lot of time and prevent erroneous results.

And that is part of what we teach:

• document intent
• check assumptions
• double-check results
• use composable small helper-tools.

`psagg()` also works well with `data.table`.

``````library("data.table")

as.data.table(d1)[
, .(y = psagg(y)), by = "x"]``````
``````##    x  y
## 1: a 10
## 2: b 20
## 3: c 20``````
``````as.data.table(d2)[
, .(y = psagg(y)), by = "x"]``````
``## Error in psagg(y): wrapr::psagg argument values are varying``

Of course we don’t strictly need `psagg()` as we could insert checks by hand (though this would become burdensome if we had many derived columns).

``````as.data.table(d1)[
, .(y = max(y),
y_was_const = min(y)==max(y)),
by = "x"]``````
``````##    x  y y_was_const
## 1: a 10        TRUE
## 2: b 20        TRUE
## 3: c 20        TRUE``````
``````as.data.table(d2)[
, .(y = max(y),
y_was_const = min(y)==max(y)),
by = "x"]``````
``````##    x  y y_was_const
## 1: a 10        TRUE
## 2: b 23       FALSE
## 3: c 20        TRUE``````

Unfortunately, this sort of checking does not currently work for `dplyr`.

``packageVersion("dplyr")``
``##  '0.7.7'``
``````d1 %>%
group_by(x) %>%
summarize(y = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()``````
``````## # A tibble: 3 x 3
##   x         y y_was_const
##   <chr> <dbl> <lgl>
## 1 a        10 TRUE
## 2 b        20 TRUE
## 3 c        20 TRUE``````
``````d2 %>%
group_by(x) %>%
summarize(y = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()``````
``````## # A tibble: 3 x 3
##   x         y y_was_const
##   <chr> <dbl> <lgl>
## 1 a        10 TRUE
## 2 b        23 TRUE
## 3 c        20 TRUE``````

Notice the per-group variation in `y` was not detected. This appears to be a `dplyr` un-caught result corruption issue. I fully get that it is odd to run into an error during a checking step, but the checking step did not in fact introduce problems- it is merely failing to catch them. We run checks like this because the data (possibly from an external source) may not be quite structured they way we were told it is.

If we don’t attempt to re-use the variable name we get the correct result.

``````d2 %>%
group_by(x) %>%
summarize(y_for_group = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()``````
``````## # A tibble: 3 x 3
##   x     y_for_group y_was_const
##   <chr>       <dbl> <lgl>
## 1 a              10 TRUE
## 2 b              23 FALSE
## 3 c              20 TRUE``````

However, being forced to rename variables during aggregation is a needless user burden.

Our `seplyr` package (a package that is a very thin adapter on top of `dplyr`) does issue a warning in the failing situation.

``````library("seplyr")

d2 %>%
group_by_se("x") %>%
summarize_nse(y = max(y),
y_was_const = min(y)==max(y)) %>%
ungroup()``````
``````## Warning in summarize_se(res, summarizeTerms, warn = summarize_nse_warn, :
## seplyr::summarize_se possibly confusing column name re-use c('y' =
## 'max(y)', 'y_was_const' = 'min(y) == max(y)')

## # A tibble: 3 x 3
##   x         y y_was_const
##   <chr> <dbl> <lgl>
## 1 a        10 TRUE
## 2 b        23 TRUE
## 3 c        20 TRUE``````

The result is still wrong, but at least in this situation the user has a better chance at noticing and working around the issue.

The problem may not be common (it may or may not be in any of your code, or code you use), and is of course easy to avoid (once you know the nature of the issue).

Categories: Programming Tutorials

Tagged as: ### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 4 replies ›

1. John Mount says:

The `dplyr` issue is probably not a user mis-understanding of what values are available when. Notice in this example we see a value that is not the expected value of the `x` column for any notion of before/after:

```dplyr::summarize(data.frame(x = 1), x = max(x), min_x = min(x))
#   x min_x
# 1 1     0
```

One can apparently fix things by making the name-reuse last. However, with no clear statement on what the column `summarize()` semantics are that seems risky. I believe `mutate()` is documented as have left-to right semantics on in-memory `data.frame`s (perhaps implied by the “Note that you can refer to columns that you’ve just created” section of R for Data Science), but it currently does not have such in databases.

```library("dplyr")

db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")

d_local <- data.frame(x = 1)

d_local %>%
mutate(y = 1, y = y + 1, y = y + 1)
#  x y
#1 1 3

d_remote <- dplyr::copy_to(db, d_local, "d_remote")

d_remote %>%
mutate(y = 1, y = y + 1, y = y + 1)
# # Source:   lazy query [?? x 2]
# # Database: sqlite 3.22.0 [:memory:]
#       x     y
#
# 1     1     2

library("seplyr")

d_remote %>%
mutate_nse(y = 1, y = y + 1, y = y + 1)
# Source:   lazy query [?? x 2]
# Database: sqlite 3.22.0 [:memory:]
#       x     y
#
# 1     1     3
# Warning message:
# In mutate_se(res, mutateTerms, splitTerms = mutate_nse_split_terms,  :
#   seplyr::mutate_se possible name conflicts in assigment

DBI::dbDisconnect(db)
```

Notice how for `dplyr` the local and remote answers differ. In this case `seplyr` not only re-factors the mutate to have reliable left to right semantics, it also issues a warning.

2. Zihao Feng says:

sequence matters!!

packageVersion(“dplyr”)

 ‘0.7.6’

d1 <- data.frame(
x = c(“a”, “a”, “b”, “b”, “c”),
y = c(10, 10, 20, 23, 20),
z = c(1, 2, 1, 1, 1),
stringsAsFactors = FALSE)

d1 %>%
group_by(x) %>%
summarize(y_was_const = min(y)==max(y),
y = max(y)) %>%
ungroup()

# A tibble: 3 x 3

x y_was_const y

1 a TRUE 10

2 b FALSE 23

3 c TRUE 20

just create y_was_const before summarizing max(y) can do the check.
:)

1. John Mount says:

Yes, but a reliable package would warn a new user against such situations (i.e. if the code in the “wrong” order is incorrect it would reject it). Neither SQL nor data.table have this problem as they both define all expression terms to unambiguously refer to columns prior to any transform.

3. Łukasz Deryło says:

Sounds nice, definetely worth trying.