Menu Home

Method Warnings

Introduction

The data algebra is a Python system for designing data transformations that can be used in Pandas or SQL. The new 1.3.0 version introduces a lot of early checking and warnings to make designing data transforms more convenient and safer.

An Example

I’d like to demonstrate some of these features with an example.

Let’s import our packages and build some simple example data.

from typing import Dict, Iterable
import numpy as np
import pandas as pd
from data_algebra.data_ops import descr, ViewRepresentation
import data_algebra.test_util
import data_algebra.BigQuery
d = pd.DataFrame({
    'id': [0, 1, 2, 3, 4],
    'x': [4, 50, 1, 3, 2.2],
    'g': ['a', 'b', 'a', 'a', 'b'],
})

d
id x g
0 0 4.0 a
1 1 50.0 b
2 2 1.0 a
3 3 3.0 a
4 4 2.2 b
The data algebra is “Python first”, in that we choose method names close to what Pandas and Numpy users expect.

An Example Problem

Our problem is to compute the median of values in column x for each group of rows identified by the column g.

To do this we will arrange methods that create new columns in a step in a transformation pipeline. In this case a “descr()” step describes the incoming data structure, an “extend()” node adds the new column, and we add a sorting (“order_rows()“) node for presentation. Node documentation can be found here. Methods, such as “.median()“, operate on column names, values, and even on intermediate results.

The Solution

With some experience, one can write the data algebra solution as follows.


ops = (
    descr(d=d)
        .extend(
            {'xm': 'x.median()'},
            partition_by=['g'])
        .order_rows(['id'])
)
The “extend()” and “order_rows()” steps are operators, which have an introduction here. What methods we can use in these nodes follows mostly Pandas and Numpy, and is in a table here.

Applying the Solution

Now let’s apply our specified transform to our example data. The new column “xm” has the correct group medians assigned to each original row.

pandas_res = ops.transform(d)

pandas_res
id x g xm
0 0 4.0 a 3.0
1 1 50.0 b 26.1
2 2 1.0 a 3.0
3 3 3.0 a 3.0
4 4 2.2 b 26.1
We can check this matches expectations.
for group in set(d['g']):
    assert np.all(
        pandas_res.loc[pandas_res['g'] == group, 'xm']
        == np.median(d.loc[d['g'] == group, 'x']))

In Database

Part of the power of the data algebra is that the transform can be translated into SQL for execution on different databases. For example, we could try to execute this query on Google BigQuery as follows.

We build a database connection and insert our example data. In real applications the data would likely be large, and already in the database.

# connect to remote Google BigQuery database
bigquery_handle = data_algebra.BigQuery.example_handle()
# insert example table into database
# in actual application data is usually already there
_ = bigquery_handle.insert_table(
    d,
    table_name='d',
    allow_overwrite=True)
We can now run the translated query. In a large scale application we would avoid the motion of data to or from Python by landing the result directly in the database using a CREATE TABLE statement.
bigquery_sql = bigquery_handle.to_sql(ops)

print(bigquery_sql)
-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `extend_0` AS (
  SELECT  -- .extend({ 'xm': 'x.median()'}, partition_by=['g'])
   `id` ,
   `x` ,
   `g` ,
   PERCENTILE_CONT(`x`, 0.5) OVER ( PARTITION BY `g`  )  AS `xm`
  FROM
   `data-algebra-test.test_1.d`
 )
SELECT  -- .order_rows(['id'])
 *
FROM
 `extend_0`
ORDER BY
 `id`

And let’s’ see that work in BigQuery.
# works
db_res = bigquery_handle.read_query(bigquery_sql)

db_res
id x g xm
0 0 4.0 a 3.0
1 1 50.0 b 26.1
2 2 1.0 a 3.0
3 3 3.0 a 3.0
4 4 2.2 b 26.1
assert data_algebra.test_util.equivalent_frames(
    pandas_res,
    db_res)

A Variation

If we wanted only one result row per group during our median calculation we would use the following pipeline, replacing the “extend” with a “project” (trying to stay close to Codd’s relational terminology).

ops_p = (
    descr(d=d)
        .project(
            {'xm': 'x.median()'},
            group_by=['g'],
        )
)
This pipeline works in Pandas as follows.
pandas_res_p = ops_p.transform(d)

pandas_res_p
g xm
0 a 3.0
1 b 26.1
And we again see expected results.
for group in set(d['g']):
    assert (
        pandas_res_p.loc[pandas_res_p['g'] == group, 'xm'].values[0]
        == np.median(d.loc[d['g'] == group, 'x']))
But, we get a warning if we attempt to convert this to BigQuery SQL.
# warns!
sql_p = bigquery_handle.to_sql(ops_p)
/Users/johnmount/Documents/work/data_algebra/data_algebra/db_model.py:1694: UserWarning: BigQueryModel translation doesn't fully support method context: [MethodUse(op_name='median', is_project=True, is_windowed=False, is_ordered=False)]
  warnings.warn(f"{self} translation doesn't fully support method context: {non_rec}", UserWarning)
It turns out, we can’t ignore the warning. Attempting to execute the SQL fails.
# indeed, fails
# Notes: https://stackoverflow.com/a/57718190/6901725
try:
    bigquery_handle.read_query(sql_p)
except Exception as ex:
    print(f'caught: {ex}')
caught: 400 percentile_cont aggregate function is not supported.

(job ID: dc82bb2d-34b7-409c-bb17-9fd7112d8476)

                 -----Query Job SQL Follows-----                  

    |    .    |    .    |    .    |    .    |    .    |    .    |
   1:-- data_algebra SQL https://github.com/WinVector/data_algebra
   2:--  dialect: BigQueryModel
   3:--       string quote: "
   4:--   identifier quote: `
   5:WITH
   6: `table_reference_0` AS (
   7:  SELECT
   8:   `g` ,
   9:   `x`
  10:  FROM
  11:   `data-algebra-test.test_1.d`
  12: )
  13:SELECT  -- .project({ 'xm': 'x.median()'}, group_by=['g'])
  14: PERCENTILE_CONT(`x`, 0.5) AS `xm` ,
  15: `g`
  16:FROM
  17: `table_reference_0`
  18:GROUP BY
  19: `g`
    |    .    |    .    |    .    |    .    |    .    |    .    |
One familiar with Google BigQuery will recognize the issue. The “PERCENTILE_CONT()” function is only available in windowed contexts (the number of rows being returned being the same as the number in the input), and not in project/grouping contexts (one row returned per group) contexts.

Given how similar the two SQL queries are, the above failure can come as a surprise. But a new feature of the data algebra is: the “translate to SQL” step warns we have a potential problem. This doesn’t even require a full database handle, it is data incorporated into the database model during package assembly.

Patching The Solution

We can work around the BigQuery limitation by simulating the project-median by the execute-median, followed by a project-mean step. However, we feel automating such a conversion would hide too many details from the user.

Let’s try that solution by hand.

ops_p_2 = (
    ops  # start with our extend median solution
        .project(
            {'xm': 'xm.mean()'},  # pseudo-aggregation, as xm is constant per group
            group_by=['g'],
        )
)
Notice we can build a solution by continuing or composing onto an existing pipeline. Re-use and composability are major features of the data algebra.
db_res_p = bigquery_handle.read_query(ops_p_2)

db_res_p
xm g
0 3.0 a
1 26.1 b
assert data_algebra.test_util.equivalent_frames(
    pandas_res_p,
    db_res_p)
As we see, that worked.

Conclusion

And that is the newest feature of the 1.3.0 data algebra: per-database SQL translation warnings. I feel the data algebra has a good breadth or footprint of correct translations. It now also good at saying if your pipeline is in fact in that correct region. The checking system helps in building and vetting complex statistical queries (such as our t-test example or our xicor example).

Appendix

We built up ops_p_2 by adding a step to ops. The data algebra has minor optimizers both in the pipeline and SQL steps. For example, we can see in the combined pipeline the intermediate order_rows() node is eliminated.

ops_p_2
(
    TableDescription(table_name="d", column_names=["id", "x", "g"])
    .extend({"xm": "x.median()"}, partition_by=["g"])
    .project({"xm": "xm.mean()"}, group_by=["g"])
)

Automating the Adaption

We can automate the “extend / project” adaption for re-use by using the use API as follows.

# define a new pipeline macro: project_by_extend()
# this is a function we would define in a project for re-use
def project_by_extend(
        pipeline: ViewRepresentation,  # incoming pipeline
        methods: Dict[str, str],  # methods key produced by expression
        *,
        group_by: Iterable[str],  # grouping columns
) -> ViewRepresentation:
    """project by first computing an extend, and then
    using pseudo-aggregators in a project.
    Uses max as the pseudo-aggregator so allow use with non-numeric types.
    """
    group_by = list(group_by)
    return (
        pipeline
            .extend(
                # do what we wanted to do in project
                # instead in windowed extend
                methods,
                partition_by=group_by)
            .project(
                # collect the results into project,
                # pseudo aggregation as values constant on groups
                {k: f'{k}.max()' for k in methods.keys()},
                group_by=group_by)
    )


# use the project_by_extend() macro while building the pipeline
ops_3 = (
    descr(d=d)
        .use(project_by_extend,
             {'xm': 'x.median()'},
             group_by=['g'])
)


# print the resulting pipeline
ops_3
(
    TableDescription(table_name="d", column_names=["id", "x", "g"])
    .extend({"xm": "x.median()"}, partition_by=["g"])
    .project({"xm": "xm.max()"}, group_by=["g"])
)
# confirm the same as constructed pipeline
assert data_algebra.test_util.equivalent_frames(
    pandas_res_p,
    ops_3.transform(d))
assert data_algebra.test_util.equivalent_frames(
    pandas_res_p,
    bigquery_handle.read_query(ops_3))

Clean Up

# clean up
bigquery_handle.drop_table('d')
bigquery_handle.close()

Categories: Coding Pragmatic Data Science Tutorials

Tagged as:

John Mount

%d