Menu Home

How to Re-Map Many Columns in a Database

Introduction

A surprisingly tricky problem in doing data science or analytics in the database are situations where one has to re-map a large number of columns. This occurs, for example, in the vtreat data preparation system. In the vtreat case, a large number of the variable encodings reduce to table-lookup or re-mapping.

For imperative systems, and in-memory data frames (such as Pandas) this presents no great problem (for example pandas.Series.map can be used on each column in sequence).

In relational databases the common solutions include:

  • Large CASE/WHEN statements (one case per variable level or value).
  • Deeply nested JOINs (one per variable).
  • Sequenced UPDATE JOINs (one per variable).

The tricky part is: data science application scale easily has hundreds of string valued variables, each having hundreds of thousands of tracked values. The possibility of a large number of variable values or level renders the CASE/WHEN solution undesirable- as the query size is proportional to the number variables and values. The JOIN solutions build a query size proportional to the number of variables (again undesirable, but tolerable). However, super deeply nested queries are just not what relational databases expect. And a sequence of updates isn’t easy to support as a single query or view.

The impedance mis-match is: re-mapping a large number of columns is a reasonable ask, but doesn’t always result in what is considered a polite query in the database.

Thankfully there is at least one more avenue for solution: the relational database’s ability to efficiently dispatch operations over a very large number of rows. If we exploit that we can get the effect of a large number of mappings in a limited, though by no means small, query.

Let’s work this as an example.

Our Example

First we import our packages, and make our notional example data.

import pandas as pd
from data_algebra.data_ops import descr
from data_algebra.solutions import def_multi_column_map
import data_algebra.cdata
import data_algebra.test_util
import data_algebra.BigQuery
d = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'va': ['a', 'b', 'a', 'c'],
    'vb': ['e', 'e', 'g', 'f'],
})

d
id va vb
0 1 a e
1 2 b e
2 3 a g
3 4 c f
For our problem, let’s say we want to re-map the values (or levels) seen for the columns va and vb to numbers. In practice, we may have hundreds of variables, and hundreds and thousands of levels. (We could generalize to sets of columns mapping to sets of columns, but the one to one example is clearer and more common in practice).

Mappings as Data

In all case this sort of re-mapping can itself be written as a table.

m = pd.DataFrame({
    'column_name': ['va', 'va', 'vb', 'vb'],
    'column_value': ['a', 'b', 'e', 'f'],
    'mapped_value': [1., 2., 3., 4.],
})

m
column_name column_value mapped_value
0 va a 1.0
1 va b 2.0
2 vb e 3.0
3 vb f 4.0
The above mapping table m is keyed by column_name and column_value. A row with a given pair of keys has a payload of mapped_value saying what number this combination is mapped to. We have co-mingled mapping rules for va with mapping rules for vb into a single table by making sure we have sufficient keys to separate the cases.

Many Mappings in a Single Join

If we re-structure the data we can re-write many mappings as a single shared join.

What we want is a copy of the data d where each row in d is represented by multiple rows. This is exactly what databases call an un-pivot or melt, and what the data algebra cdata system was designed to manage.

Such a transform is specified as follows.

unpivot = data_algebra.cdata.unpivot_specification(
    row_keys=['id'],
    col_name_key='column_name',
    col_value_key='column_value',
    value_cols=['va', 'vb'],
)
All we have done is say what columns identify records (id), which columns we want to take values from (va and vb), and how we want those values laid-out after the unpivot (by column_name and column_value). The effect of the transform is, it re-encodes rows of the data frame as identifiable groups of rows.
unpivot.transform(d)
id column_name column_value
0 1 va a
1 1 vb e
2 2 va b
3 2 vb e
4 3 va a
5 3 vb g
6 4 va c
7 4 vb f
Notice each row id now occurs twice. The important observation is: in remapping all of our variables is just a single join against m using column_name and column_value as composite join keys. The entire transform, join, and then inverse transform (back into all entries in single row format: a fundamental form we call a “row record”).
ops = def_multi_column_map(
    descr(d=d),
    mapping_table=descr(m=m),
    row_keys=['id'],
    cols_to_map=['va', 'vb'],
    coalesce_value=0.0,
    cols_to_map_back=['va_mapped', 'vb_mapped'],
)
We expect ops to re-process d into the following.
expect = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'va_mapped': [1.0, 2.0, 1.0, 0.0],
    'vb_mapped': [3.0, 3.0, 0.0, 4.0],
})

expect
id va_mapped vb_mapped
0 1 1.0 3.0
1 2 2.0 3.0
2 3 1.0 0.0
3 4 0.0 4.0
Let’s see if we get that result.
res = ops.eval({'d': d, 'm': m})

res
id va_mapped vb_mapped
0 1 1.0 3.0
1 2 2.0 3.0
2 3 1.0 0.0
3 4 0.0 4.0
And we can confirm the results do indeed match.
assert data_algebra.test_util.equivalent_frames(res, expect)

In Database

ops is a data algebra pipeline (actually a directed acyclic graph, or DAG). Being such, it also can be run on adapted database by automatic translation to SQL.

Let’s see this run in the Google BigQuery database.

First we build our connection, and insert our example data. Of course, the point of using a database is that the data is usually already there (not something we insert).

# try it in database
db_handle = data_algebra.BigQuery.example_handle()
db_handle.insert_table(d, table_name='d', allow_overwrite=True)
db_handle.insert_table(m, table_name='m', allow_overwrite=True)
db_handle.drop_table("merged")
We can then create a table of results, without any additional data motion to or from the database.
db_handle.execute(
    f"CREATE TABLE {db_handle.db_model.table_prefix}.merged AS {db_handle.to_sql(ops)}")
To look at the result, we bring it back to Python/Jupyter.
db_res = db_handle.read_query(
    f"SELECT * FROM {db_handle.db_model.table_prefix}.merged ORDER BY id")

db_res
va_mapped vb_mapped id
0 1.0 3.0 1
1 2.0 3.0 2
2 1.0 0.0 3
3 0.0 4.0 4
assert data_algebra.test_util.equivalent_frames(db_res, expect)
And we have confirmed, we get the same result. This sort of methodology is what allows the vtreat data preparation system to be run in a database at data warehouse scale.

Conclusion

We showed how to efficiently perform a large number of variable re-mappings in a SQL database. Variable re-mapping or lookup is a fundamental step for data analytics and data science. We can translate the process to SQL’s primary operation: the JOIN. Some care must be taken to translate into a small number of operations over a large number of rows, as this is where relational databases shine.

Appendix

The Data Algebra Pipeline

We can take a look at the actual data algebra pipeline. It is, as promised, converting the records, doing one big join, and then converting the records back. Though it is nice to have a convenience function to write out such a pipeline for us (the huge advantage of programming over composable, inspectable, and value oriented APIs).

ops
(
    TableDescription(table_name="d", column_names=["id", "va", "vb"])
    .select_columns(["id", "va", "vb"])
    .convert_records(
        data_algebra.cdata.RecordMap(
            blocks_in=None,
            blocks_out=data_algebra.cdata.RecordSpecification(
                record_keys=["id"],
                control_table=pd.DataFrame(
                    {"column_name": ["va", "vb"], "column_value": ["va", "vb"],}
                ),
                control_table_keys=["column_name"],
            ),
        )
    )
    .natural_join(
        b=TableDescription(
            table_name="m", column_names=["column_name", "column_value", "mapped_value"]
        ).select_columns(["column_name", "column_value", "mapped_value"]),
        by=["column_name", "column_value"],
        jointype="LEFT",
    )
    .extend({"mapped_value": "mapped_value.coalesce(0.0)"})
    .convert_records(
        data_algebra.cdata.RecordMap(
            blocks_in=data_algebra.cdata.RecordSpecification(
                record_keys=["id"],
                control_table=pd.DataFrame(
                    {"column_name": ["va", "vb"], "mapped_value": ["va", "vb"],}
                ),
                control_table_keys=["column_name"],
            ),
            blocks_out=None,
        )
    )
    .rename_columns({"va_mapped": "va", "vb_mapped": "vb"})
)

The SQL

We can also take a peek at the SQL realizing this pipeline over Google BigQuery. It is large, because the record transformation steps are themselves realized in terms of SQL primitives (unpivots are joins, and pivots are aggregations). The main thing we can say about this query is, we didn’t have to write it! And yes, it is machine generated SQL deliberately targeting a simple sub-grammar of the language. So hand-rolled SQL would be smaller.

print(db_handle.to_sql(ops))
-- data_algebra SQL https://github.com/WinVector/data_algebra
--  dialect: BigQueryModel
--       string quote: "
--   identifier quote: `
WITH
 `convert_records_blocks_out_0` AS (
  -- convert records blocks out
  SELECT
     a.`id` AS `id`,
     b.`column_name` AS `column_name`,
     CASE   WHEN CAST(b.`column_value` AS STRING) = "va" THEN a.`va`   WHEN CAST(b.`column_value` AS STRING) = "vb" THEN a.`vb`  ELSE NULL END AS `column_value`
   FROM ( SELECT * FROM
   `data-algebra-test.test_1.d`
    ) a
   CROSS JOIN (
    SELECT
     *
    FROM (
        (SELECT "va" AS `column_name`, "va" AS `column_value`)
        UNION ALL (SELECT "vb" AS `column_name`, "vb" AS `column_value`)
    ) `table_values`
    ) b
    ORDER BY
    a.`id`,
    b.`column_name`
 ) ,
 `natural_join_1` AS (
  SELECT  -- _0..natural_join(b= _1, by=['column_name', 'column_value'], jointype='LEFT')
   COALESCE(`convert_records_blocks_out_0`.`column_name`, `data-algebra-test.test_1.m`.`column_name`) AS `column_name` ,
   COALESCE(`convert_records_blocks_out_0`.`column_value`, `data-algebra-test.test_1.m`.`column_value`) AS `column_value` ,
   `id` ,
   `mapped_value`
  FROM
  (
   `convert_records_blocks_out_0`
  LEFT JOIN
   `data-algebra-test.test_1.m`
  ON (
   `convert_records_blocks_out_0`.`column_name` = `data-algebra-test.test_1.m`.`column_name`  AND
   `convert_records_blocks_out_0`.`column_value` = `data-algebra-test.test_1.m`.`column_value`
  )
  )
 ) ,
 `extend_2` AS (
  SELECT  -- .extend({ 'mapped_value': 'mapped_value.coalesce(0.0)'})
   `id` ,
   `column_name` ,
   `column_value` ,
   COALESCE(`mapped_value`, 0.0) AS `mapped_value`
  FROM
   `natural_join_1`
 ) ,
 `convert_records_blocks_in_3` AS (
  -- convert records blocks in
  SELECT
     `id` AS `id`,
     MAX(CASE WHEN  ( CAST(`column_name` AS STRING) = "va" )  THEN `mapped_value` ELSE NULL END) AS `va`,
     MAX(CASE WHEN  ( CAST(`column_name` AS STRING) = "vb" )  THEN `mapped_value` ELSE NULL END) AS `vb`
   FROM ( SELECT * FROM
   `extend_2`
    ) a
   GROUP BY
    `id`
   ORDER BY
    `id`
 )
SELECT  -- .rename_columns({'va_mapped': 'va', 'vb_mapped': 'vb'})
 `va` AS `va_mapped` ,
 `vb` AS `vb_mapped` ,
 `id`
FROM
 `convert_records_blocks_in_3`

Clean Up

# clean up
db_handle.drop_table("df")
db_handle.drop_table("m")
db_handle.drop_table("xicor")
db_handle.close()
# show we made it to here, and did not assert earlier
print('done')
done

source code for article

Categories: Opinion Pragmatic Data Science Tutorials

Tagged as:

John Mount

%d bloggers like this: