An example of this is wanting to know any many reservations for a San Francisco Symphony concert scheduled for December 4th 2022 are known to have been made by October 22nd 2022. This could be used as part of an attendance demand model that is evaluated on October 22nd 2022. The “fifty-cent word” for this is “bitemporal” modeling or data.
Let’s start off with an example in Python (this entire article can be found as a Jupyter notebook here).
# import our packages
import pandas as pd
import sqlite3
import data_algebra.SQLite
from data_algebra.data_ops import *
# define our example log data
d_actions = pd.DataFrame({
"target_date": [11, 10, 11, 11],
"action_date": [7, 9, 9, 10],
"reservation_count": [3, 1, 2, 5],
})
d_actions
target_date | action_date | reservation_count | |
---|---|---|---|
0 | 11 | 7 | 3 |
1 | 10 | 9 | 1 |
2 | 11 | 9 | 2 |
3 | 11 | 10 | 5 |
Our second table is a specification of what pairs of dates we want observations for. For example we might want to know how many reservations were made for dates 10 and 11 on one or two days before these target_dates. We will keep the date the reservation is for as “target_date” and we will call the date we have information up to the “as_of_date”. Bitemporal data can become hard to reason about if one is not very explicit on naming and defining the multiple date keys. There is something uniquely confusing about reasoning about mutliple dates or times all at once.
Our example desired observation points could be build up as follows using the data algebra (GitHub).
# build our observation plan from the specification
target_dates = range(10, 12)
deltas = range(-2, 0)
d_observations = (
data(target_dates=pd.DataFrame({"target_date": target_dates}))
.natural_join(
data(date_lags=pd.DataFrame({"delta": deltas})),
on=[],
jointype="cross"
)
.extend({"as_of_date": "target_date + delta"})
.drop_columns(["delta"])
.order_rows(["target_date", "as_of_date"])
.ex()
)
d_observations
target_date | as_of_date | |
---|---|---|
0 | 10 | 8 |
1 | 10 | 9 |
2 | 11 | 9 |
3 | 11 | 10 |
A great way to assemble our desired result is to use SQL.
sql_query = """
SELECT
o.target_date,
o.as_of_date,
SUM(COALESCE(d.reservation_count, 0)) AS reservation_count
FROM
d_observations o
LEFT JOIN
d_actions d
ON
o.target_date = d.target_date
AND d.action_date <= o.as_of_date
GROUP BY
o.as_of_date,
o.target_date
ORDER BY
o.as_of_date,
o.target_date
"""
with sqlite3.connect(":memory:") as conn:
d_observations.to_sql("d_observations", conn, index=False)
d_actions.to_sql("d_actions", conn, index=False)
res_sql = data_algebra.default_data_model.pd.read_sql_query(sql_query, conn)
res_sql
target_date | as_of_date | reservation_count | |
---|---|---|---|
0 | 10 | 8 | 0 |
1 | 10 | 9 | 1 |
2 | 11 | 9 | 5 |
3 | 11 | 10 | 10 |
However, let’s return to assembling the above result. The SQL succinctly included a number of features: replacing missing observations with zero, joining observation keys to action keys, and aggregation. At first glance this might seem difficult to translate into a data processing system based more closely on Codd’s Relational Algebra such as Pandas, dplyr, or the data algebra. The issue is these systems have operations corresponding to Codd’s basic relational steps: projection, selection, and joining; while SQL has very powerful composite SELECT notation that can combine many of these steps into one specification.
Of course we have Codd’s theorem on the equivalence of expressive power of various data calculi. However, one may still be left with the worry of efficiency. One also may worry over size of intermediate materialized tables, as a naive translation of the SQL query into relational steps may involve an “under-keyed join” which can lead to an explicit large intermediate result prior to filtering and aggregation .
Let’s study a non-trivial translation of the problem into relational steps. In some sense we are forced to over-specify the solution as we move away from SQL’s declarative nature to specifying the exact relational operators in exact order (which is usually the job of a query planner). Or we may benefit by choosing a translation well suited for our data. In either case my proposed data algebra relational solution is as follows.
ops = (
# 1) concat observations points into as count-0 actions so every
# observation key is in the new actions table.
descr(d_observations=d_observations)
.project(
{},
group_by=["target_date", "as_of_date"])
.extend({"reservation_count": 0})
.map_columns({"as_of_date": "action_date"})
.concat_rows(descr(d_actions=d_actions))
# 2) aggregate actions to eliminate any duplicate keys we have have introduced
.project(
{"reservation_count": "reservation_count.sum()"},
group_by=["target_date", "action_date"])
# 3) use cumulative sum to get reservations known up to a given date instead of exactly on a given date
.extend(
{"reservation_count": "reservation_count.cumsum()"},
partition_by=["target_date"],
order_by=["action_date"]
)
# 4) equi-join to the observation specifications to pull out desired data
.map_columns({"action_date": "as_of_date"})
.natural_join(
descr(d_observations=d_observations),
on=["target_date", "as_of_date"],
jointype="inner"
)
.order_rows(["target_date", "as_of_date"])
)
Of course the proof is in the pudding, let’s see the result through these methods. To get a result using Pandas as the execution engine, it is enough to call eval.
res_Pandas = ops.eval({"d_actions": d_actions, "d_observations": d_observations})
assert res_Pandas.equals(res_sql)
res_Pandas
target_date | as_of_date | reservation_count | |
---|---|---|---|
0 | 10 | 8 | 0 |
1 | 10 | 9 | 1 |
2 | 11 | 9 | 5 |
3 | 11 | 10 | 10 |
We can also use the operator chain remotely in many target databases (GoogleBigQuery, PostgreSQL, Redshift, SparkSQL, and others). For example.
with sqlite3.connect(":memory:") as conn:
d_observations.to_sql("d_observations", conn, index=False)
d_actions.to_sql("d_actions", conn, index=False)
res_sql2 = data_algebra.default_data_model.pd.read_sql_query(ops.to_sql(data_algebra.SQLite.SQLiteModel()), conn)
assert res_sql2.equals(res_sql)
res_sql2
target_date | as_of_date | reservation_count | |
---|---|---|---|
0 | 10 | 8 | 0 |
1 | 10 | 9 | 1 |
2 | 11 | 9 | 5 |
3 | 11 | 10 | 10 |
print(ops.to_sql())
-- data_algebra SQL https://github.com/WinVector/data_algebra -- dialect: SQLiteModel 1.4.3 -- string quote: ' -- identifier quote: " WITH "project_1" AS ( SELECT -- .project({ }, group_by=['target_date', 'as_of_date']) "target_date" , "as_of_date" FROM "d_observations" GROUP BY "target_date" , "as_of_date" ) , "extend_2" AS ( SELECT -- .extend({ 'reservation_count': '0'}) "target_date" , "as_of_date" , 0 AS "reservation_count" FROM "project_1" ) , "map_columns_3" AS ( SELECT -- .map_columns({'as_of_date': 'action_date'}) "target_date" , "as_of_date" AS "action_date" , "reservation_count" FROM "extend_2" ) , "extend_4" AS ( SELECT -- .extend({ 'source_name': "'a'"}) "target_date" , "action_date" , "reservation_count" , 'a' AS "source_name" FROM "map_columns_3" ) , "extend_5" AS ( SELECT -- .extend({ 'source_name': "'b'"}) "target_date" , "action_date" , "reservation_count" , 'b' AS "source_name" FROM "d_actions" ) , "concat_rows_6" AS ( SELECT -- _0..concat_rows(b= _1, id_column='source_name', a_name='a', b_name='b') "target_date" , "reservation_count" , "action_date" FROM ( SELECT * FROM "extend_4" UNION ALL SELECT * FROM "extend_5" ) "concat_rows_6" ) , "project_7" AS ( SELECT -- .project({ 'reservation_count': 'reservation_count.sum()'}, group_by=['target_date', 'action_date']) "target_date" , "action_date" , SUM("reservation_count") AS "reservation_count" FROM "concat_rows_6" GROUP BY "target_date" , "action_date" ) , "extend_8" AS ( SELECT -- .extend({ 'reservation_count': 'reservation_count.cumsum()'}, partition_by=['target_date'], order_by=['action_date']) "target_date" , "action_date" , SUM("reservation_count") OVER ( PARTITION BY "target_date" ORDER BY "action_date" ) AS "reservation_count" FROM "project_7" ) , "map_columns_9" AS ( SELECT -- .map_columns({'action_date': 'as_of_date'}) "target_date" , "action_date" AS "as_of_date" , "reservation_count" FROM "extend_8" ) , "natural_join_0" AS ( SELECT -- _0..natural_join(b= _1, on=['target_date', 'as_of_date'], jointype='INNER') COALESCE("join_source_left_0"."target_date", "join_source_right_0"."target_date") AS "target_date" , COALESCE("join_source_left_0"."as_of_date", "join_source_right_0"."as_of_date") AS "as_of_date" , "reservation_count" FROM ( "map_columns_9" "join_source_left_0" INNER JOIN "d_observations" "join_source_right_0" ON "join_source_left_0"."target_date" = "join_source_right_0"."target_date" AND "join_source_left_0"."as_of_date" = "join_source_right_0"."as_of_date" ) ) SELECT -- .order_rows(['target_date', 'as_of_date']) * FROM "natural_join_0" ORDER BY "target_date" , "as_of_date"
And that is our exercise in bitemporal modeling using the data algebra.
Categories: Coding Exciting Techniques Tutorials
Nice