Introduction
I would like to talk about some of the design principles underlying the data_algebra
package (and also in its sibling rquery
package).
The data_algebra
package is a query generator that can act on either Pandas
data frames or on SQL
tables. This is discussed on the project site and the examples directory. In this note we will set up some technical terminology that will allow us to discuss some of the underlying design decisions. These are things that when they are done well, the user doesn’t have to think much about. Discussing such design decisions at length can obscure some of their charm, but we would like to point out some features here.
(Note: if there are any blogrendering problems the original sourceworkbook used to create this article can be found here.)
Background
We will introduce a few ideas before trying to synthesize our thesis.
The data_algebra
An introduction to the data_algebra
package can be found here. In this note we will discuss some of the inspirations for the package: Codd’s relational algebra, experience working with dplyr
at scale, sklearn Pipeline, and category theory.
sklearn Pipeline
A major influence on the data_algebra
design is sklearn.pipeline.Pipeline
. sklearn.pipeline.Pipeline
itself presumably become public with Edouard Duchesnay’s Jul 27, 2010 commit: “add pipeline”.
sklearn.pipeline.Pipeline
maintains a list of steps to be applied to data. What is interesting is the steps are not functions. Steps are instead objects that implement both a .transform()
and a .fit()
method.
.transform()
typically accepts a dataframe type structure and returns a modified version. Typical operations include adding a new derived column, selecting columns, selecting rows, and so on.
From a transformalone point of view the steps compose like functions. For list [s, t]
transform(x)
is defined to as:
transform([s, t], x) :=
t.transform(s.transform(x))
(the glyph “:=
” denoting “defined as”).
The fitperspective is where things get interesting. obj.fit(x)
changes the internal state of obj based on the value x
and returns a reference to obj
. I.e. it learns from the data and stores this result as a sideeffect in the object itself. In sklearn
it is common to assume a composite method called .fit_transform()
often defined as:
obj.fit_transform(x) := obj.fit(x).transform(x)
(though for our own
vtreat
package, this is deliberately not the case).
Using .fit_transform()
we can explain that in a sklearn Pipeline
.fit()
is naturally thought of as:
fit([s, t], x]) :=
t.fit(s.fit_transform(x))
My point is: sklearn.pipeline.Pipeline
generalizes function composition to something more powerful: the ability to both fit and to transform. sklearn.pipeline.Pipeline
is a natural way to store a sequence of parameterized or datadependent data transform steps (such as centering, scaling, missing value imputation, and much more).
This gives as a concrete example of where rigid mindset where “function composition is the only form of composition” would not miss design opportunities.
Fist class citizens
We are going to try to design our tools to be “first class citizens” in the sense of Strachey:
First and second class objects. In
ALGOL
, a real number may appear in an expression or be assigned to a variable, and either of them may appear as an actual parameter in a procedure call. A procedure, on the other hand, may only appear in another procedure call either as the operator (the most common case) or as one of the actual parameters. There are no other expressions involving procedures or whose results are procedures. Thus in a sense procedures inALGOL
are second class citizens—they always have to appear in person and can never be represented by a variable or expression (except in the case of a formal parameter)…
What we will draw out: is if our data transform steps are “first class citizens” we should expect to be able to store them in variables, compose them, examine them, and many other steps. A function that we can only use or even reuse is not giving us as much as we expect from other types. Or alternately, if functions don't give us everything we want, we may not want to use them as our only type or abstraction of data processing steps.
Composability
Most people first encounter the mathematical concept of “composability” in terms of functions. This can give the false impression that to work with composable design principles, one must shoehorn the object of interest to be functions or some sort of augmented functions.
This Procrustean view loses a lot of design opportunities.
In mathematics composability is directly studied by the field called “Category Theory.” So it makes sense to see if category theory may have ideas, notations, tools, and results that may be of use.
Category Theory
A lot of the benefit of category theory is lost if every time we try to apply category theory (or even just use some of the notation conventions) we attempt to explain all of category theory as a first step. So I will try to resist that urge here. I will introduce the bits I am going to use here.
Category theory routinely studies what are called “arrows.” When treated abstractly an arrow has two associated objects called the “domain” and “codomain.” The names are meant to be evocative of the “domain” (space of inputs) and “codomains” (space of outputs) from the theory of functions.
Functions are commonly defined as having:
 A domain, which is a set of values the function can be applied to.
 A codomain, which is a set of values the function evaluations are contained in (or range).

A fixed composition rule that if
domain(g)
is contained incodomain(f)
then:g . f
is defined as the function such that(g . f)(x) = g(f(x))
for all values in the domain of f. Mathematical functions are usually thought of as a specialization of binary relation, and considered to be uniquely determined by their evaluations (by the axiom of extensionality).
Packages that use function composition typically collect functions in lists and define operator composition either through lambdaabstraction or through list concatenation.
Category theory differs from function theory in that category theory talks about arrows instead of functions. The theory is careful to keep separate the following two concepts: what arrows are and how arrows are composed.
When using arrows to model a system we expect to be able to specify, with some extra degrees of freedom in specifying:
 How to choose domains and codomains, for categories these do not have to be sets.

How arrows compose. For arrows
a
andb
withcodomain(b) = domain(a)
then:a . b
denotes the composition in the category, and is itself a new arrow in the same category. Composition is not allowed (or defined) whencodomain(b) != domain(a)
.  How arrows act on items. Arrows can have multiple actions, and arrows can act on items that are not elements of their domains.
An action is a mapping from arrows and items to items. I.e. action(arrow, item) = new_item
. For categories the items may or may not be related to the domain and codomain. Not all categories have actions, but when they do have actions the action must be compatible with arrow composition.
Good general references on category theory, include:
 Steve Awodey, Category Theory, 2nd Edition, Oxford University Press; 2010.
 Emily Riehl, Category Theory in Context, Dover, 2016.
 Saunders Mac Lane, Categories for the Working Mathematician, 2nd Edition, Springer, 1978.
Functions have a very ready category theory interpretation as arrows. Given a function f
with domain A
and codomain B
, we can think of any triple (f, A', B')
as an arrow in a category of functions if A' contained in A
and B contained in B'
. In this formulation we define the arrow composition of
and (f, A', B')
(g, C', D')
as (f . g, C', B')
where f . g
is defined to be the function such that for all x
in domain x
we have:
(f . g)(x) := f(g(x))
We will call the application of a function to a value as an example of an “action.” A function f()
“acts on its domain” and f(x)
is the action of f
on x
. For functions we can define the action “apply
” as:
apply(f, x) := f(x)
The extra generalization power we get from moving away from functions to arbitrary arrows (that might not correspond to functions) comes from the following:
 Arrow composition does not have to be function composition.
 Arrows can have multiple actions, and act on items that are not elements of their domain.
 Arrows have a notion of equality, but this notion can differ from having identical actions.
To be a category a few conditions must be met, including: the composition must be associative and we must have some identity arrows. By “associative composition” we mean, it must be the case that for arrows a
, b
, and c
(with appropriate domains and codomains):
(a . b) . c = a . (b . c)
Our action must also associate with arrow composition. That is we must have for values x
we must have for covariant actions:
apply(a . b, x) = apply(a, apply(b, x))
Or for contravariant actions:
apply(a . b, x) = apply(b, apply(a, x))
The idea is: the arrow a . b
must have an action equal to the actions of a and b composed as functions. That is: arrow composition and actions can differ from function composition and function application, but they must be at least somewhat similar in that they remain associative.
We now have the background to see that category theory arrows differ from functions in that arrows are more general (we can pick more of their properties) and require a bit more explicit bookkeeping.
Back to sklearn.pipeline.Pipeline
We now have enough notation to attempt a crude category theory description of sklearn.pipeline.Pipeline
.
Define our sklearn.pipeline.Pipeline
category P
as follows:

We have only one object called
0
. All arrows will have domain and codomain equal to0
, i.e.: we are not doing any interesting precondition checking in this category. This sort of category is called a “monoid.” 
The arrows of our category are lists of steps.
Steps are againPython
objects that define.transform()
,.fit()
, and.fit_transform()
methods. 
Composition
a1 . a2
is defined as the list concatenation:a2 + a1
. “+
” beingPython
's list concatenate in this case, and the order set to matchsklearn.pipeline.Pipeline
list order convention. 
We define an action called “
transform_action
” defined as:
transform_action([step1, step2, ..., stepk], x) := stepk.transform(... step2.transform(step1.transform(x)) )
To see this is a category (and a category compatible action) we must check associativity of the composition (which in this case is list concatenation) and associativity of the action with respect to list concatenation.
We can also try to model the .fit_transform()
methods. We will not try to model the sideeffect that .fit_transform()
changes state of the arrows (to have the fit information in each step). But we can at least define an action (with side effects) as follows:

We define an action called “
fit_transform
” defined as:fit_transform_action([step1, step2, ..., stepk], x) := stepk.fit_transform(... step2.fit_transform(step1.fit_transform(x)) )
To confirm this is an action (ignoring the sideeffects), we would want check is if the following equality holds or not:
fit_transform_action(a . b, x) =
fit_transform_action(b, fit_transform_action(a, x))
The above should follow by brute pushing notation around (assuming we have defined fit_transform_action
correctly, and sufficiently characterized .fit_transform()
).
Notice we didn't directly define a “fit_action
” action, as it isn't obvious that has a not obvious that has a nice associative realization. This an opportunity for theory to drive design; notation considerations hint that fit_transform()
may be more fundamental than, and thus preferred over, fit()
.
The category theory concepts didn't somuch design sklearn.pipeline.Pipeline
, but give us a set of criteria to evaluate sklearn.pipeline.Pipeline
design. We trust the category theory point of view is useful as it emphasizes associativity (which is a great propriety to have), and is routinely found to be a set of choices that work in complicated systems. The feeling being: the design points category theory seems to suggest, turn out to be the ones you want down the round.
The data_algebra
Now that we have some terminology, lets get back to the data_algebra
What is the data_algebra
?

data_algebra
is a package for building up complex data manipulation queriesdata_algebra
queries are first class citizens in the Strachey sense (can be: passed as an argument, returned from a function, modified, assigned to a variable, printed, inspected, and traversed as a data structure).  The operators are essentially those of the Codd relational algebra (select rows/columns, join, unionall, extend, project, and window functions).
 Composition is left to right using method chaining.

Queries can be realized by transforming to
SQL
(targetingPostgeSQL
,Spark
, and other implementations), or as acting onPandas
data (we are hoping to extend this tomodin
,RAPIDS
, and others).  The
data_algebra
has anR
sibling package group
(rquery
/rqdatatable
) similar todplyr
.
An introduction to the data_algebra
can be found here.
We now have the terminology to concisely state a data_algebra
design principle: use general concepts (such as category theory notation) to try and ensure data_algebra
transforms are first class citizens (i.e. we can do a lot with them and to them).
The naive functional view
If we were to again take a mere functional view of the data_algebra
we would say the data_algebra
is a set of functions that operate on data. They translate data frames to new data frames using Coddinspired operations. We could think of the data_algebra
as acting on data on the right, and acting on data_algebra
operators on the left.
However, this is not the right abstraction. data_algebra
methods primarily map data transforms to data transforms. However even this is a “too functional view”. It makes sense to think of data_algebra
operators as arrows, and the whole point of arrows is composition.
The categorical view
The data_algebra
can be mapped to a nice category. The idea being something that can be easily mapped to an orderly system, is it self likely an somewhat orderly system.
Good references on the application of category theory to concrete systems (including databases) include:
 David I. Spivak, Category Theory for the Sciences; The MIT Press, 2014.
 Brendan Fong, David I. Spivak, An Invitation to Applied Category Theory: Seven Sketches in Compositionality; Cambridge University Press, 2019.
Our data_algebra
category D
is defined as follows.
 The objects of our category are single table schemas. By “single table schema” mean mean only the list of column names (and optionally column types) for a set of named tables. We are not modeling invariants, or crosstable relations.

The arrows of our category are
data_algebra
operator chains.  Composition of arrows in our category is a very general query composition. We will demonstrate query composition in a bit, but as a hint it is not function composition or list concatination.
Some notes on the category theory interpretation of the data_algebra
package can be found here.
Let's demonstrate the above with Python
code. The data_algebra
allows for the specification of data transforms as first class objects.
First we import some modules and create some example data.
from data_algebra.data_ops import *
import pandas
d = pandas.DataFrame({
'x': [1, 2, 3],
'y': [3, 4, 4],
})
d
x  y  

0  1  3 
1  2  4 
2  3  4 
To specify adding a new derived column z
we would write code such as the following.
td = describe_table(d)
a = td.extend(
{ 'z': 'x.mean()' },
partition_by=['y']
)
a
TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y'])
We can let this transform act on data.
a.transform(d)
x  y  z  

0  1  3  1.0 
1  2  4  2.5 
2  3  4  2.5 
We can compose this transform with more operations to create a composite transform.
b = a.extend({
'ratio': 'y / x'
})
b
TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']) .
extend({ 'ratio': 'y / x'})
As a bonus we can also map the above transform to a SQL
query representing the same action in databases.
from data_algebra.SQLite import SQLiteModel
print(
b.to_sql(db_model=SQLiteModel(), pretty=True)
)
SELECT "x",
"y",
"z",
"y" / "x" AS "ratio"
FROM
(SELECT "x",
"y",
avg("x") OVER (PARTITION BY "y") AS "z"
FROM ("data_frame") "SQ_0") "SQ_1"
All of this is the convenient interface we expect users will want. However, if we asked that all operators specified their expected input schema (or their domain) we have the category D
. We don't expect users to do such, but we have code supporting this style of notation to show that the data_algebra
is in fact related to a nice category over schemas.
Lets rewrite the above queries as formal category arrows.
from data_algebra.arrow import *
a1 = DataOpArrow(a)
print(str(a1))
[ 'data_frame': [ x: <class 'numpy.int64'>, y: <class 'numpy.int64'> ] > [ x, y, z ] ]
The above is rendering the arrow as just its domain and codomain. The domain and codomains are just singletable schemas: lists of column names (possibly with column types).
We can get a more detailed representation of the arrow as follows.
print(a1.__repr__())
DataOpArrow( TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']), free_table_key='data_frame')
Or we can examine the domain and codomain directly. Here we are using a common category theory trick: associating the object with the identity arrow of the object. So what we are showing as domain and codomains are actually identity arrows instead of objects.
a1.dom()
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']), free_table_key='')
a1.cod()
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y', 'z']), free_table_key='')
Now we can write our second transform step as an arrow as follows.
a2 = DataOpArrow(a1.cod_as_table().extend({
'ratio': 'y / x'
}))
a2
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y', 'z']) .
extend({ 'ratio': 'y / x'}), free_table_key='')
We took extra steps, that most users will not want to take, to wrap the secondstage (a2
) operations as an arrow. Being an arrow means that we have a domain and codomain that can be used to check if operations are composable.
A typical user would not work with arrow, but instead work with the data algebra which itself is a shorthand for the arrows. That is: the users may want the power of a category, but they don't want to be the one handling the extra bookkeeping. To add an extra operation a user would work directly with a
and just write the following.
a.extend({
'ratio': 'y / x'
})
TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']) .
extend({ 'ratio': 'y / x'})
The above has substantial precondition checking and optimizations (as it is merely user facing shorthand for the arrows).
The more cumbersome arrow notation (that requires the specification of preconditions) has a payoff: managed arrow composition. That is: complex operator pipelines can be directly combined. We are not limited to extending one operation at a time.
If the codomain of arrow matches the domain of another arrow we can compose them left to right as follows.
a1.cod() == a2.dom()
True
composite = a1 >> a2
composite
DataOpArrow( TableDescription( table_name='data_frame', column_names=[ 'x', 'y']) .
extend({ 'z': 'x.mean()'}, partition_by=['y']) .
extend({ 'ratio': 'y / x'}), free_table_key='data_frame')
And when this isn't the case, composition is not allowed. This is exactly what we want as this means the preconditions (exactly which columns are present) for the second arrow are not supplied by the first arrow.
a2.cod() == a1.dom()
False
try:
a2 >> a1
except ValueError as e:
print("Caught: " + str(e))
Caught: extra incoming columns: {'ratio', 'z'}
An important point is: for this arrow notation composition is not mere list concatenation or function composition. Here is an example that makes this clear.
b1 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None).
extend({
'x': 'x + 1',
'y': 7
}))
b1
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']) .
extend({ 'x': 'x + 1', 'y': '7'}), free_table_key='')
b2 = DataOpArrow(TableDescription(column_names=['x', 'y'], table_name=None).
extend({
'y': 9
}))
Now watch what happens when we use “>>
” to compose the arrow b1
and b2
.
b1 >> b2
DataOpArrow( TableDescription( table_name='', column_names=[ 'x', 'y']) .
extend({ 'x': 'x + 1', 'y': '9'}), free_table_key='')
Notice in this special case the composition of b1
and b2
is a single extend node combining the operations and eliminating the deadvalue 7
. The idea is: the package has some freedom to define composition as long as it is associative. In this case we have an optimization at the compose step so the composition is not list concatenation or function composition.
As we have said, a typical user will not take the time to establish preconditions on steps. So they are not so much working with arrows but with operators that can be specialized to arrows. An actual user might build up the above pipeline as follows.
TableDescription(column_names=['x', 'y'], table_name=None).
extend({
'x': 'x + 1',
'y': 7
}).
extend({
'y': 9
})
TableDescription( table_name='', column_names=[ 'x', 'y']) .
extend({ 'x': 'x + 1', 'y': '9'})
We recently demonstrated this sort of optimization in the R
rquery
package.
In the above example the user still benefits from the category theory design. As they composed left to right the system was able to add in the preconditions for them. The user only needs to set preconditions for nontrivial righthand side pipelines.
Conclusion
The advantage the data_algebra
package gets from category theory is: it lets us design the package action (how the package works on data) somewhat independently from operator composition. This gives us a lot more design room and power than a strict function composition or list concatenation theory would give us.
Categories: data science Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
I have some more notes on treating transform pipelines as arrows (including those that involve multiple table, by marking one table as the distinguished interface or a “pointed” strategy) here and here.