Authors: John Mount and Nina Zumel
In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot.
One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” or “gathering“) is easy to explain, as the operation is a function that takes a single row and builds groups of new rows in an obvious manner. We commented that the inverse operation of moving data into rows, or the “widening” operation (often called “pivoting”, “unstacking”, “casting”, or “spreading”) is harder to explain as it takes a specific group of columns and maps them back to a single row. However, if we take extra care and factor the pivot operation into its essential operations we find pivoting can be usefully conceptualized as a simple single row to single row mapping followed by a grouped aggregation.
Please read on for our thoughts on teaching pivoting data.
(edit 11-12-2017: the
cdata package now supplies these operations with slightly different names and signatures, please see here for updated details.)
In data science data-rows are often considered to be instances. Because of this the data scientist needs explicit control over which facts fall into a single row. If we are trying to compute the relative prevalence of a birth-names by year broken down by sex we probably want both sexes in a single row. If we are trying to graph the same data using the R package ggplot2 we may want each year plus sex to determine a different row. Our thesis is that these differences are inessential for features of data presentation and not to be confused with properties of the underlying data.
Because we need to move from form to form we need both terminology to discuss the transforms and tools the implement the transforms.
For example when we were preparing our recent Strata workshop on Spark/R/Sparklyr we started with materials from our RStudio partners and found ourselves puzzled by one bit of code:
birthsYearly <- applicants_tbl %>% mutate(male = ifelse(sex == "M", n_all, 0), female = ifelse(sex == "F", n_all, 0)) %>% group_by(year) %>% summarize(Male = sum(male) / 1000000, Female = sum(female) / 1000000) %>% arrange(year) %>% collect
One of your authors (Nina Zumel) found this code much easier to understand once she added a comment indicating intent such as:
# by-hand spread on remote data
And the other author (John Mount) noticed that this implementation of “pivot” or “spread” was a better implementation idea than he had previously been toying with to add “pivot” (or “move values to columns”) capabilities to remote data implementations (databases and Spark).
This two stage version of pivot (widening individual rows and then summarizing by groups) is also a great way to teach data shaping techniques, which we will discuss here.
Teaching moving data to rows
Moving data to rows is easy to teach through examples. Suppose we have the following data frame:
d <- data.frame( index = c(1, 2, 3), meas1 = c('m1_1', 'm1_2', 'm1_3'), meas2 = c('m2_1', 'm2_2', 'm2_3'), stringsAsFactors = FALSE) print(d) # index meas1 meas2 # 1 1 m1_1 m2_1 # 2 2 m1_2 m2_2 # 3 3 m1_3 m2_3
We can convert this into a “thin” form with a call such as the following:
library("dplyr") library("cdata") d2 <- moveValuesToRows(d, nameForNewKeyColumn= 'meastype', nameForNewValueColumn= 'meas', columnsToTakeFrom= c('meas1','meas2')) %>% arrange(index) print(d2) # index meastype meas # 1 1 meas1 m1_1 # 2 1 meas2 m2_1 # 3 2 meas1 m1_2 # 4 2 meas2 m2_2 # 5 3 meas1 m1_3 # 6 3 meas2 m2_3
The idea is: intent is documented through the method name and verbose argument bindings. As we mentioned in our earlier article, this transform is easy to teach as you can meaningfully think about it operating on each input row separately:
moveValuesToRows(d[1, , drop=FALSE], nameForNewKeyColumn= 'meastype', nameForNewValueColumn= 'meas', columnsToTakeFrom= c('meas1','meas2')) %>% arrange(index) # index meastype meas # 1 1 meas1 m1_1 # 2 1 meas2 m2_1
Teaching moving data to columns
As we taught earlier, with the proper pre-conditions, we can consider moving data to columns as an inverse operation to moving data to rows. We can undo the last transform with:
d1p <- d2 %>% moveValuesToColumns(columnToTakeKeysFrom = 'meastype', columnToTakeValuesFrom = 'meas', rowKeyColumns = 'index') %>% arrange(index) all.equal(d, d1p) #  TRUE
Teaching moving data to columns at first blush seems harder as the operation as normally presented takes sets of rows as inputs. However, this is not an essential feature of moving data to columns. It is just an optimization or convenience that is so deeply ingrained into implementations it becomes part of the explanations.
Consider the following “incomplete” implementation of moving data to columns from the development version of replyr.
devtools::install_github("WinVector/replyr") library("replyr") d1q <- d2 %>% replyr_moveValuesToColumns(columnToTakeKeysFrom = 'meastype', columnToTakeValuesFrom = 'meas', rowKeyColumns = 'index', dosummarize = FALSE, fill = '') %>% arrange(index) print(d1q) # index meas1 meas2 # 1 1 m1_1 # 2 1 m2_1 # 3 2 m1_2 # 4 2 m2_2 # 5 3 m1_3 # 6 3 m2_3
This notation makes the motion of values to columns obvious: each row from the original data frame produces a single new row in the result data frame that:
- Has a new column for each possible values seen in “columnToTakeKeysFrom”.
- Populates the column matching the value in “columnToTakeKeysFrom” with the value from “columnToTakeValuesFrom”.
- Populates other new columns (those taking names from “columnToTakeKeysFrom”) with easy to remove placeholder values.
- Copies over all other column values.
Once we see this it becomes clear moving values to columns is an operation very much like the expansion of levels in “
stats::model.matrix()” or 1-hot encoding (also called “dummy variables” or “indicators”), which place ones in columns instead of arbitrary values.
Dummy or indicator column encoding example from Practical Data Science with R, Zumel, Mount; Manning 2014.
In fact calling
model.matrix() gives us a structure very similar to the “
model.matrix(~ 0 + index + meastype, data = d2) # index meastypemeas1 meastypemeas2 # 1 1 1 0 # 2 1 0 1 # 3 2 1 0 # 4 2 0 1 # 5 3 1 0 # 6 3 0 1
The reason we bring this up is that things are easier to learn when they are in a shared, familiar context, and not treated as unique, “remarkable” occurrences.
To finish the conversion back to the original frame “
d” we just have to add back in the neglected aggregation (which was intentionally suppressed by the “
dosummarize = FALSE” option):
d1recovered <- d1q %>% group_by(index) %>% summarize_all("max") %>% arrange(index) print(d1recovered) # # A tibble: 3 × 3 # index meas1 meas2 # <dbl> <chr> <chr> # 1 1 m1_1 m2_1 # 2 2 m1_2 m2_2 # 3 3 m1_3 m2_3 all.equal(d, data.frame(d1recovered)) #  TRUE
And we have inverted the operation and recovered “
d“! Demonstrating sequences of moving values to columns and moving values to rows is key to building familiarity and trust in these operations. This is whey we work such sequences here and in our previous article (yielding the following strongly connected graph converting between four different scientist’s preferred data representations):
The typical explanation of “pivot” for spreadsheet users contains aggregation as an integral part, and the typical explanations and diagrams used by
R teachers also include a hidden aggregation (though only in the weaker sense of coalescing rows). Separating row transforms completely from value aggregation/coalescing makes pivoting (or moving values to columns) much more comprehendible and teachable.
We feel showing the notional intermediate form of the “expanded data frame” we introduced here when moving values to columns (the “
d1q” frame) greatly improves learnability and comprehension. We also feel one should consistently use the terms “moving values to columns” and “moving values to rows” instead of insisting new students memorize non-informative technical name. Likely the “expanded data frame” is not taught as it is not usually the actual implementation (as it is in fact temporarily wasting space).
The development version of
replyr now implements a move values to columns operation explicitly in terms of this expansion, and we have demonstrated the method working on top of Spark2.0. This “be temporarily wasteful” strategy is actually compatible with how one designs high-throughput big-data systems leaning hard on the aphorism:
“The biggest difference between time and space is that you can’t reuse time.”Merrick Furst
Edit 11-12-2017: updated API.
Categories: Expository Writing Opinion Pragmatic Data Science Tutorials
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.