# Your Data is Never the Right Shape

One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have the tools to reshape your data.

There is no final “right shape.” In fact even your data is never right. You will always be called to re-do your analysis (new variables, new data, corrections) so you should always understand you are on your “penultimate analysis” (always one more to come). This is why we insist on using general methods and scripted techniques, as these methods are much much easier to reliably reapply on new data than GUI/WYSWYG techniques.

In this article we will work a small example and call out some R tools that make reshaping your data much easier. The idea is to think in terms of “relational algebra” (like SQL) and transform your data towards your tools (and not to attempt to adapt your tools towards the data in an ad-hoc manner).Take a simple example where you are designing a new score called “`score2`” to predict or track an already known value called “`score1`.” The typical situation is `score1` is a future outcome (such as the number of dollars profit on a transaction) and `score2` is a prediction (such as the estimated profit before the transaction is attempted). Training data is usually assembled by performing a large number of transactions, recording what was known before the transaction and then aligning or joining this data with measured results when they become available. For this example we are not interested in the inputs driving the model (a rare situation, but we are trying to make our example as simple as possible) but only examining the quality of `score2` (which is defined as how well it tracks `score1`).

All of this example will be in R, but the principles are chosen apply more generally. First let us enter some example data:

``` > d <- data.frame(id=c(1,2,3,1,2,3),score1=c(17,5,6,10,13,7),score2=c(13,10,5,13,10,5)) > d ```

This gives us our example data. Each row is numbered (1 through 6) has an `id` and both our scores:

```  id score1 score2
1  1     17     13
2  2      5     10
3  3      6      5
4  1     10     13
5  2     13     10
6  3      7      5
```

We said our only task was to characterize how well `score2` works at predicting `score1` (or how good a substitute `score2` is for `score1`). We could compute correlation, RMS error, info-gain or some such. But instead lets look at this graphically. We will prepare a graph showing how well `score1` is represented by `score2`. For this we choose to place `score1` on the y-axis (as it is the outcome) and `score2` on the x-axis (as it is the driver).

``` > library(ggplot2) > ggplot(d) + geom_point(aes(x=score2,y=score1)) ``` Figure 1: `score1` as a function of `score2`.

This does not look good. We would liked to have seen all of the dots falling on the line “y=x.” This plot shows `score2` is not predicting `score1` very well. Part of this is that we missed an important feature of the data (and because we missed it the feature becomes a problem): the `id`s repeat. First we re-order by `id` to make this more obvious.

``` > dsort <- d[order(d\$id),] > dsort ```

```  id score1 score2
1  1     17     13
4  1     10     13
2  2      5     10
5  2     13     10
3  3      6      5
6  3      7      5
```

This is a very common situation. The original score is not completely a function of the known inputs. We are using “`id`” to abstract represent all of the inputs, two rows in our example have the same `id` if and only if all known inputs are exactly the same. The repeating `id`s are the same experiment run at different times (a good idea) and the variation in `score1` could be the effect of an un-modeled input that changed value or something simple like a “noise term” (a random un-modeled effect). Notice that `score2` is behaving as a function of `id`– all rows with the same `id` have the same value for `score2`. If `score2` is a model then it has to be a function of the inputs (or more precisely if it is not a function of the inputs you have done something wrong). So any variation of `score1` between rows with identical `id` is “unexplainable variation” (unexplainable from the point of view of currently tracked inputs). You should know about, characterize and report this variation (why it is good to have some repeated experiments). But this variation is not the model’s fault, if we want to know how good a job we did constructing the model (which we now see can be a slightly different question than how well the model works at prediction) we need to see how much of the explainable variation the model accounts for.

If we assume (as is traditional) the unexplained variation is from a “unbiased noise source” then we can lessen the impact of the noise source by replacing `score1` with a value averaged over rows with the same `id`. This assumption is traditional because an unbiased noise source is present in many problems and assuming anything more requires more research into the problem domain. You would eventually fold such research into your model- so your goal is always have all effects or biases in your model and hope what is left over is unbiased. This is usually not strictly true, but not accounting for the unexplained variation at all is in many cases even worse than modeling the unexplained variation as being bias-free.

And now we find our data is the “wrong shape.” To replace `score1` with the appropriate averages we need to do some significant data manipulation. We need to group sets of rows and add new columns. We could do this imperatively (write some loops and design some variables to track and manipulate state) or declaratively (find a path of operations from what you have to what you need through R’s data manipulation algebra). Even though the declarative method is more work the first time (you could often write the code in less time than it takes to read the manuals) it is the right way to go (as it is more versatile and powerful in the end).

Luckily we don’t have to use raw R. There are a number of remarkable packages (all by Hadley Wickham who is also the author of the ggplot2 package we use to prepare our figures) that really improve R’s ability to coherently manage data. The easiest (on us) way do fix up our data is to make the computer work hard and use the powerful melt/cast technique. These functions are found in the libraries reshape and plyr (which were automatically loaded with we loaded ggplot2 library).

melt is a bit abstract. What it does convert your data into a “narrow” format where rows are split into many rows each carrying just one result column of the original row. For example we can melt our data by `id` as follows:

``` > dmelt <- melt(d,id.vars=c('id')) > dmelt ```

Which yields the following:

```   id variable value
1   1   score1    17
2   1   score1    10
3   2   score1     5
4   2   score1    13
5   3   score1     6
6   3   score1     7
7   1   score2    13
8   1   score2    13
9   2   score2    10
10  2   score2    10
11  3   score2     5
12  3   score2     5
```

Each of the two facts (`score1`, `score2`) from our original row is split into its own row. The `id` column plus the new variable column are now considered to be keys. This format is not used directly but used because it is easy to express important data transformations in terms of it. For instance we wanted our table to have duplicate rows collected and `score1` replaced by its average (to attempt to remove the unexplainable variation). That is now easy:

``` > dmean <- cast(dmelt,fun.aggregate=mean) > dmean ```

```  id score1 score2
1  1   13.5     13
2  2    9.0     10
3  3    6.5      5
```

We used `cast()` in its default mode, where it assumes all columns not equal to “value” are the keyset. It then collects all rows with identical keying and combines them back into wide rows using mean or average as the function to deal with duplicates. Notice `score1` is now the desired average, and `score2` is as before (as it was a function of the keys or inputs, so it is not affected by averaging). With this new smaller data set we can re-try our original graph:

``` > ggplot(dmean) + geom_point(aes(x=score2,y=score1)) ``` Figure 2: `mean(score1)` as a function of `score2`.

This doesn’t look so bad. A lot of the error or variation in the first plot was unexplainable variation. `score2` isn’t bad given its inputs. If you wanted to do better than `score2` you would be advised to find more modeling inputs (versus trying more exotic modeling techniques).

Of course a client or user is not interested if `score2` is “best possible.” They want to know if it is any good. To do this we should show them (either by graph or by quantitative summary statistics like we mentioned earlier) at least 3 things:

1. How well the model predicts overall (the very first graph we presented).
2. How much of the explainable variation the model predicts (the second graph).
3. The nature of the unexplained variation (which we will explore next).

We said earlier we are hoping the unexplained variation is noise (or if it is not noise it would be nice if it is a clue to new important modeling features). So the unexplained variation must not go unexamined. We will finish by showing how to characterize the unexplained variation. As before will will just make a graph, but the data preparation steps would be exactly the same if we were using a quantitive summary (like correlation, or any other). And, of course, our data is still not the right shape for this step. Luckily there is another tool ready to fix this: `join()`.

``` > djoin <- join(dsort,dsort,'id') > fixnames <- function(cn) { n <- length(cn); for(i in 2:((n+1)/2)) { cn[i] <- paste('a',cn[i],sep='') }; for(i in ((n+3)/2):n) { cn[i] <- paste('b',cn[i],sep='') }; cn } > colnames(djoin) <- fixnames(colnames(djoin)) > djoin ```

which produces:

```   id ascore1 ascore2 bscore1 bscore2
1   1      17      13      17      13
2   1      17      13      10      13
3   1      10      13      17      13
4   1      10      13      10      13
5   2       5      10       5      10
6   2       5      10      13      10
7   2      13      10       5      10
8   2      13      10      13      10
9   3       6       5       6       5
10  3       6       5       7       5
11  3       7       5       6       5
12  3       7       5       7       5
```

All of the work was done by the single line “`djoin <- join(dsort,dsort,'id')`” the rest was just fixing the column names (as self-join is not the central use case of join). What we have now is a table that is exactly right for studying unexplained variation. For each `id` we have each row with the same `id` matched. This blows every `id` from having 2 rows in `dsort` to 4 rows in `djoin`. Notices this gives us every pair of `score1` values seen for the same `id` (which will let us examine unexplained variation) and `score2` is still constant over all rows with the same `id` (as it has always been throughout our analysis). With this table we can now plot how `score1` varies for rows with the same `id`:

``` > ggplot(djoin) + geom_point(aes(x=ascore1,y=bscore1)) ``` Figure 3: `score1` as a function of `score1`.

And we can see, as we expected, the unexplained variation in `score1` is about as large as the mismatch between `score1` and `score2` in our original plot. The important thing is this is all about `score1` (`score2` is now literally out of the picture). The analyst’s job would now be to try and tie bits of the unexplained variation to new inputs (that can be folded into a new `score2`) and/or characterize the noise term (so the customer knows how close they should expect repeated experiments to be).

What we are trying to encourage with the use of “big hammer tools” is an ability and willingness to look at and transform your data in meaningful steps. It often seems easier and more efficient to build one more piece of data tubing, but a lot of data tubes become an unmanageable collection of spaghetti code. The analyst should, in some sense, always be looking at data and not looking at coding details. For these sort of analyses we encourage analysts to think in terms of “data shape” and transforms. This discipline leaves more of the analysts energy and attention to think productively about the data and actual problem domain.

Note:

For the third plot showing the variation of `score1` across different rows (but same `id`s) it may be appropriate to use a slightly more complicated `join()` procedure than we showed. The join shown produced rows of artificial agreement where both values of `score1` came from the same row (thus had no chance of being different, so in some sense deserve no credit). This is also the only way any non-duplicated evaluations could make it to the plot. To eliminate these uninteresting agreements from the plot do the following:

``` > d\$rowNumber <- 1:(dim(d)) > djoin <- join(d,d,'id') > colnames(djoin) <- fixnames(colnames(djoin)) > djoin <- djoin[djoin\$arowNumber!=djoin\$browNumber,] > djoin ```

This gives us a table that shows only values of `score1` from different rows:

```   id ascore1 ascore2 arowNumber bscore1 bscore2 browNumber
2   1      17      13          1      10      13          4
4   2       5      10          2      13      10          5
6   3       6       5          3       7       5          6
7   1      10      13          4      17      13          1
9   2      13      10          5       5      10          2
11  3       7       5          6       6       5          3
```

And only plots points on the diagonal if “you have really earned them”: So while the direct `join()` may not be the immediate perfect answer it is still a good intermediate to form as what you want is only simple data transformation away from it.

Tagged as: ### jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

### 2 replies ›

1. Scott Locklin says:

I’ve never managed to get the hang of plyr, probably because I mostly am looking at time series. I can see the utility … but would be afraid of screwing it up without working through an actual tutorial. Know any good ones offhand?

2. jmount says:

@Scott Locklin Scott, I haven’t had good luck with plyr either. Hadley’s article is interesting and plyr is being used here for the join() step- but really in this article reshape is doing all of the heavy lifting.