Statisticians and data scientists want a neat world where data is arranged in a table such that every row is an observation or instance, and every column is a variable or measurement. Getting to this state of “ready to model format” (often called a denormalized form by relational algebra types) often requires quite a bit of data manipulation. This is how
data.frames describe themselves (try “
str(data.frame(x=1:2))” in an
R-console to see this) and is part of the tidy data manifesto.
SQL (structured query language) and
dplyr can make the data arrangement process less burdensome, but using them effectively requires “index free thinking” where the data are not thought of in terms of row indices. We will explain and motivate this idea below.
SQL is actually a rich modeling environment with interesting semantics due to its relation to E.F. Codd’s relational model. However, working with
SQL databases presents a number of practical difficulties.
- Analysts are forced to think in relational terms (normalized forms, key columns, constraints, and so on)
- Not being dedicated to transient analysis,
SQLdatabases are often are configured to prohibit the alteration of the data in place (which makes sense, analysis should not change the initial facts).
- “Full” databases discourage copying tables or making a large number of intermediate structures
- Ongoing transactions transactions cause data to change over time.
Working around these difficulties, perhaps by setting up a staging database, or even containerizing the database, still leaves two problems:
- The difficulty of the
- The idea that you are not supposed to identify rows or depend on implicit row order.
These two problems are side-effects of being forced to “think in relational terms.” The first problem can be mitigated with adapters, such as
dplyr. The second problem is one to take to heart: stop thinking about row numbers and implicit row order. Learn to re-phrase your analysis in a row-ID independent manner. Most predictive analytic goals are row-ID independent (example: fitting a linear model doesn’t depend on row order), so it makes sense from a point of purity to insist that the intermediate analysis steps also not depend on row-IDs.
We are not going to directly demonstrate any
SQL queries in this article, but be aware that
SQL concepts and relational model concepts are the foundation of a lot of the user-visible semantics of substantial data stores (such as
Split-Apply-Combine and dplyr
Most transformations that appear to need row-ids (such as subsetting, and computing cumulative sums) can be re-phrased in terms of grouping and ordering. Many ideas for accomplishing this reformulation can be found in Hadley Wickham’s article “The Split-Apply-Combine Strategy for Data Analysis” (link).
Three obvious ways to implement the Split-Apply-Combine strategy are:
Of the three, only the
dplyr::group_by realization is commonly practical on large remote data sources. However, in our own
replyr::gapply we are experimenting with working with including an actual
split based approach.
A proper modern
SQL implementation (such as
Spark SQL) supports all of the operations of Split-Apply-Combine (though pivoting/un-pivoting or gather/spread has fairly shaky database support, despite database documentation claiming otherwise).
SQL just expresses grouping and ordering in an ugly way: with the grouping and ordering both being aspects of the query and not annotations on the data. This makes sense as one of the cases
SQL considers is read-only data with no temporary tables, so there is nowhere for the user to maintain the annotations other than in the query.
SQL comes from the age of data centers when the computer’s time was more valuable than the analysts’ time. Thus, burdens were placed on the analysts (i.e. in the
SQL). However, as mentioned above, this difficulty can be overcome with adapters.
SQL queries can hold all the annotations and constraints needed for sophisticated analysis, and this process can be automated. The analyst does not have to be the one to put them in the query.
dplyr seeks to mitigate creeping
SQL query complexity through adapters and many notation improvements, and includes one major semantic improvement: grouping is a visible annotation on the data. For example we move from writing seemingly convenient index-driven code that looks like:
data(iris) irism <- iris setosaIndexes <- which(irism$Species=='setosa') irism[setosaIndexes,'Sepal.Length'] <- 1 + irism[setosaIndexes,'Sepal.Length']
To writing purer transforms such as:
library('dplyr') data(iris) irism <- iris irism %>% split(irism$Species) -> itable itable$setosa$Sepal.Length <- 1 + itable$setosa$Sepal.Length irism <- dplyr::bind_rows(itable)
At first this seems like a lot of work, but the row-ID free notation eventually pays off. The Split-Apply-Combine approach is much more composable and usable on many remote data sources (many of which deliberately do not have concepts of durable row-order or row-numbers).
Split-Apply-Combine is interesting because it lets us perform arbitrary simultaneous grouped and ordered operations such as: in-group min/max, in-group ranking, in-group cumulative sums, in-group replacement of values, and so on. Split-Apply-Combine achieves this is by maintaining groups by splitting, and imposing order by later sub-queries or per-group operations (
SQL tables are not guaranteed to maintain row-orders, but sub-queries can). However, I have come to think of splitting is an implementation detail that we want to limit the lifetime and visibility of.
I am coming to think the useful Split-Apply-Combine primitive pattern looks like a wrapped function supplying the semantics:
data %>% split(column1) %>% lapply(arrange(column2) %>% f()) %>% bind_rows()
f()” is a user supplied function or
In my next article in this series I will further discuss the specialization of Split-Apply-Combine to this Grouped-Ordered-Apply pattern.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.