I would like to use such an example in Python as yet another introduction to the data algebra (an alternative to direct Pandas or direct SQL notation).
First let’s import our packages and set up our example data.
from typing import List
import pandas
from data_algebra.data_ops import *
import data_algebra.SQLite
# some example data
d1 = pandas.DataFrame({
'ID': [2, 3, 7, 7],
'OP': ['A', 'B', 'B', 'D'],
})
d2 = pandas.DataFrame({
'ID': [1, 1, 2, 3, 4, 2, 4, 4, 5, 5, 6],
'OP': ['A', 'B', 'A', 'D', 'C', 'A', 'D', 'B', 'A', 'B', 'B'],
})
keys = ['ID']
- Count how many rows in a given table have a given set of key values. This operation is called project.
- Join counts from one table to counts to another table. This operation is called natural_join.
- Replace missing counts, coming from keys present in one table and not another, with zero. This operation is called coalesce, and we perform this operation in an extend node.
First, we write code that implements the join step.
def merge_in_counts(
pipeline: ViewRepresentation,
id_cols: List[str],
new_table_descr: TableDescription):
return pipeline.natural_join(
b=new_table_descr
.project(
{f'count_{new_table_descr.table_name}': '(1).sum()'},
group_by=id_cols),
by=id_cols,
jointype='full')
ops = (
data(d1=d1)
.project({'count_d1': '(1).sum()'}, group_by=['ID'])
.use(merge_in_counts, ['ID'], data(d2=d2))
)
data(d1=d1)
step starts the pipeline with our d1
data frame. The .use()
step treats the first argument as if it was in fact a method with the argument that follow. This allows us to easily treat user code such as merge_in_counts()
as if they were class method extensions.
Now we insert the commands to clean up any count columns that may have picked up missing values. This is made easy as the operator pipeline tracks used and produced columns for us.
count_cols = [c for c in ops.column_names if c.startswith('count_')]
ops = (
ops
.extend({f'{c}': f'{c}.coalesce_0()' for c in count_cols})
.order_rows(['ID'])
)
print(ops)
( TableDescription(table_name="d1", column_names=["ID", "OP"]) .project({"count_d1": "(1).sum()"}, group_by=["ID"]) .natural_join( b=TableDescription(table_name="d2", column_names=["ID", "OP"]).project( {"count_d2": "(1).sum()"}, group_by=["ID"] ), by=["ID"], jointype="FULL", ) .extend({"count_d1": "count_d1.coalesce(0)", "count_d2": "count_d2.coalesce(0)"}) .order_rows(["ID"]) )
To execute the pipeline we simply call .ex()
. This executes all the operations on the data captured by the data()
statemetns.
ops.ex()
ID | count_d1 | count_d2 | |
---|---|---|---|
0 | 1 | 0.0 | 2.0 |
1 | 2 | 1.0 | 2.0 |
2 | 3 | 1.0 | 1.0 |
3 | 4 | 0.0 | 3.0 |
4 | 5 | 0.0 | 2.0 |
5 | 6 | 0.0 | 1.0 |
6 | 7 | 2.0 | 0.0 |
descr()
instead of data()
.
In all cases, to execute on new data we would call .eval()
, like so.
ops.eval({'d1': d1, 'd2': d2})
ID | count_d1 | count_d2 | |
---|---|---|---|
0 | 1 | 0.0 | 2.0 |
1 | 2 | 1.0 | 2.0 |
2 | 3 | 1.0 | 1.0 |
3 | 4 | 0.0 | 3.0 |
4 | 5 | 0.0 | 2.0 |
5 | 6 | 0.0 | 1.0 |
6 | 7 | 2.0 | 0.0 |
Data algebra pipelines also have built in SQL translators for many common SQL dialects (currently PostgreSQL, Google Big Query, MySQL, SQLite, and SparkSQL; but extension is easy).
db_handle = data_algebra.SQLite.example_handle()
db_handle.insert_table(d1, table_name='d1')
_ = db_handle.insert_table(d2, table_name='d2')
For example, we can use the data algebra generated query to create a new table int the database, with no additional round tripping.
db_handle.execute(f'CREATE TABLE result AS {db_handle.to_sql(ops)}')
db_handle.read_query('SELECT * FROM result ORDER BY ID LIMIT 3')
db_handle.to_sql(ops)
(actually we don’t need a live handle, the data algebra also includes explicit user accessible database models). In databases that have a full join, that is directly used. Adapting to divergent data dialects is one of the purposes of the data algebra. The Pandas interface itself is also parameterized to support replacement.
The design idea is: what if we took Codd’s relational algebra and left it as described in his 1970 article (instead of aggressively combining unrelated operations as SQL does). R users will see this as being related to dplyr, which is one of our reference systems and likely also owes ideas to Codd.
And that is the data algebra. It is a series of Codd relational data operators operating on themselves, and then appliable to data sources. The primary focus of the data algebra is composition of operations, with the intent of making decomposing problems easier. The data algebra is particularly useful in long term query maintenance, as modifying queries is easy given the emphasis on explicit meta-data and composition.
db_handle.close()
Categories: Coding Exciting Techniques Opinion Pragmatic Data Science Tutorials