I want to discuss a nice series of figures used to teach relational join semantics in R for Data Science by Garrett Grolemund and Hadley Wickham, O’Reilly 2016. Below is an example from their book illustrating an inner join:
Please read on for my discussion of this diagram and teaching joins.
In the above diagram two tables are laid out at angles, lines are extended from every row in each table, and a subset of the line intersections are marked as obeying the join condition and hence being in the result. It is a great diagram for discussing the meaning of joins. Being able to organize data transforms in terms of joins is a critical data science skill, so there is great value in being able to teach join theory.
I’ve been trying teaching joins as notional expansion followed by selection as shown in my recent Strata Spark Workshop (material developed in cooperation with Garrett Grolemend and others):
However, de-emphasizing the sequence of operations and the rejected join possibilities is an attractive alternative.
Deriving the diagram
The full or outer join operator is denoted as follows:
Many relational joins can be thought of as conditions (often denoted as “theta”) under an outer or full-join operator applied to appropriately augmented tables (adding “no-match” rows to each table before taking the cross product). The join symbol and theory are supposed to evoke the notion of a cross product, so teaching in terms of that seems sensible.
There are many methods of illustrating a set cross product including:
As a menu of possible combinations:
- As a complete bipartite graph:
- As a matrix or grid:
If you flip the grid to an angle where both sets of source nodes have equivalent roles then you are getting back to the diagram of Grolemund and Wickham:
The idea is that the very many pairs induced by the full cross product are illustrated, but they are decorations on the crossing lines. This makes it easy to believe these induced nodes are notional, and (as is the case with real databases) only the ones needed are actually produced.
I looked around for a short while for common SQL diagrams.
Venn diagrams are typically over-promoted as join mnemonics:
(Also discussed here.)
However there were some interesting illustrations trying both the grid and bipartite graph styles.
I like the idea of teaching all joins as filters (or theta-conditions) of the outer join. The Grolemund/Wickham diagrams are a good tool and have a style that reminds me of diagrammatic proofs of classic geometric theorems.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.