I would like to write a bit on the meaning and history of the phrase “tidy data.”
Hadley Wickham has been promoting the term “tidy data.” For example in an eponymous paper, he wrote:
In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
Wickham, Hadley “Tidy Data”, Journal of Statistical Software, Vol 59, 2014.
Let’s try to apply this definition to following data set from the Wikipedia:
Tournament | Year | Winner | Winner Date of Birth |
---|---|---|---|
Indiana Invitational | 1998 | Al Fredrickson | 21 July 1975 |
Cleveland Open | 1999 | Bob Albertson | 28 September 1968 |
Des Moines Masters | 1999 | Al Fredrickson | 21 July 1975 |
Indiana Invitational | 1999 | Chip Masterson | 14 March 1977 |
This would seem to be a nice “ready to analyze” data set. Rows are keyed by tournament and year, and rows carry additional key-derived observations of winner’s name and winner’s date of birth. From such a data set we could look for repeated winners, and look at the age of winners.
A question is: is such a data set “tidy”? The paper itself claims the above definitions are “Codd’s 3rd normal form.” So, no the above table is not “tidy” under that paper’s definition. The the winner’s date of birth is a fact about the winner alone, and not a fact about the joint row keys (the tournament plus year) as required by the rules of Codd’s 3rd normal form. The critique being: this data presentation does not express the intended data invariant that Al Fredrickson must have the same “Winner Date of Birth” in all rows.
Around January of 2017 Hadley Wickham apparently retconned the “tidy data” definition to be:
Tidy data is data where:
- Each variable is in a column.
- Each observation is a row.
- Each value is a cell.
Notice point-3 is now something possibly more related to Codd’s guaranteed access rule, and now the example table is plausibly “tidy.”
The above concept was already well known in statistics and called a “data matrix.” For example:
A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual.
Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.
One must understand that in statistics, “individual” often refers to observations, not people.
The above reference clearly considers “data matrix” to be a noun phrase already in common use in statistics. It is in the book’s index, and often used in phrases such as:
Suppose X is an n × p data matrix …
Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 75.
So statistics not only already has the data organization concepts, statistics already has standard terminology around it. Data engineering often called this data organization a “de-normalized form.”
As a further example, the statistical system R, itself uses variations the above standard terminology. Take for instance the help()
text from R’s data.matrix()
method:
data.matrix {base} R Documentation Convert a Data Frame to a Numeric Matrix Description Return the matrix obtained by converting all the variables in a data frame to numeric mode and then binding them together as the columns of a matrix. Factors and ordered factors are replaced by their internal codes.
What is the extra “Factors and ordered factors are replaced by their internal codes” part going on about? That is also fairly standard, let’s expand the earlier data matrix quote a bit to see this.
A standard method of displaying a multivariate set of data is in the form of a data matrix in which rows correspond to sample individuals and columns to variables, so that the entry in the ith row and jth column gives the value of the jth variate as measured or observed on the ith individual. When presenting data in this form, it is customary to assign a numerical code to the categories of a qualitative variable …
Krzanowski, W. J., F. H. C. Marriott, Multivariate Analysis Part 1, Edward Arnold, 1994, page 1.
Note: for many R analyses the model.matrix()
command is implicitly called in preference to the data.matrix()
command, as this conversion expands factors into “dummy variables”- which is a representation often more useful for modeling. The model.matrix()
documentation starts as follows:
model.matrix {stats} R Documentation Construct Design Matrices Description model.matrix creates a design (or model) matrix, e.g., by expanding factors to a set of dummy variables (depending on the contrasts) and expanding interactions similarly.
So to summarize: the whole time we have been talking about well understood concepts of organizing data for analysis that have a long history.
Frankly it appears “tidy data” is something akin to a trademark or marketing term, especially in its “tidyverse” variation.
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Of course, “tidy” is not a dogma. Either is the Wikipedia data set: e.g. why not also split the date in YYYY, MM, DD, or keep it in the date format YYYMMDD[hhmmss[n..n][TZ]] and use this as a standard time?
The Wikipedia data is exactly the HTML copied from Wikipedia table, I have’t processed the date in any way.
I always find Hadley somewhat… conflicted. On the one hand, he wants ‘rigour’ in the data, but on the other hand, he wants the ‘row’ to be the classic ‘observation’. e.i., what we RDBMS folk call a ‘flat file image’. There is no way to reconcile the two notions. Yes, one can, and should, store data in xNF, but until someone writes stat packs inside SQL databases, IOW automagically doing the joins, RDBMS storage will be NF while stat pack data will be flat (it has to be with current and legacy stat packs).
It would be more useful if Hadley, et al, would be more forceful in directing stat folks to do data storage in SQL/RDBMS, using things like PL/R possibly, rather than creating yet another impedance barrier with SQL-veiling syntax in R.
The way I say it is: analysis likes a denormalized image (all facts ready in one row). But your systems of record should not use this, so treat this format as something transitory you build from other sources. The analogy falls apart a bit when collecting data from multiple times in one row: which table is “ready” depends on if you think of time as a key or not (it often is for the data, but not for the analysis).
As far as syntax. SQL is universal, but it can be daunting. Even Codd thought their may be multiple query languages, merely insisting the query language must be strong enough to perform all tasks (including DB management).