Introduction
A surprisingly tricky problem in doing data science or analytics in the database are situations where one has to re-map a large number of columns. This occurs, for example, in the vtreat data preparation system. In the vtreat case, a large number of the variable encodings reduce to table-lookup or re-mapping.
For imperative systems, and in-memory data frames (such as Pandas) this presents no great problem (for example pandas.Series.map can be used on each column in sequence).
In relational databases the common solutions include:
- Large CASE/WHEN statements (one case per variable level or value).
- Deeply nested JOINs (one per variable).
- Sequenced UPDATE JOINs (one per variable).
The tricky part is: data science application scale easily has hundreds of string valued variables, each having hundreds of thousands of tracked values. The possibility of a large number of variable values or level renders the CASE/WHEN solution undesirable- as the query size is proportional to the number variables and values. The JOIN solutions build a query size proportional to the number of variables (again undesirable, but tolerable). However, super deeply nested queries are just not what relational databases expect. And a sequence of updates isn’t easy to support as a single query or view.
The impedance mis-match is: re-mapping a large number of columns is a reasonable ask, but doesn’t always result in what is considered a polite query in the database.
Thankfully there is at least one more avenue for solution: the relational database’s ability to efficiently dispatch operations over a very large number of rows. If we exploit that we can get the effect of a large number of mappings in a limited, though by no means small, query.
Let’s work this as an example.
Our Example
First we import our packages, and make our notional example data.
import pandas as pd
from data_algebra.data_ops import descr
from data_algebra.solutions import def_multi_column_map
import data_algebra.cdata
import data_algebra.test_util
import data_algebra.BigQuery
= pd.DataFrame({
d 'id': [1, 2, 3, 4],
'va': ['a', 'b', 'a', 'c'],
'vb': ['e', 'e', 'g', 'f'],
})
d
id | va | vb | |
---|---|---|---|
0 | 1 | a | e |
1 | 2 | b | e |
2 | 3 | a | g |
3 | 4 | c | f |
va
and vb
to numbers. In practice, we may have hundreds of variables, and hundreds and thousands of levels. (We could generalize to sets of columns mapping to sets of columns, but the one to one example is clearer and more common in practice).
Mappings as Data
In all case this sort of re-mapping can itself be written as a table.
= pd.DataFrame({
m 'column_name': ['va', 'va', 'vb', 'vb'],
'column_value': ['a', 'b', 'e', 'f'],
'mapped_value': [1., 2., 3., 4.],
})
m
column_name | column_value | mapped_value | |
---|---|---|---|
0 | va | a | 1.0 |
1 | va | b | 2.0 |
2 | vb | e | 3.0 |
3 | vb | f | 4.0 |
m
is keyed by column_name
and column_value
. A row with a given pair of keys has a payload of mapped_value
saying what number this combination is mapped to. We have co-mingled mapping rules for va
with mapping rules for vb
into a single table by making sure we have sufficient keys to separate the cases.
Many Mappings in a Single Join
If we re-structure the data we can re-write many mappings as a single shared join.
What we want is a copy of the data d
where each row in d
is represented by multiple rows. This is exactly what databases call an un-pivot or melt, and what the data algebra cdata system was designed to manage.
Such a transform is specified as follows.
= data_algebra.cdata.unpivot_specification(
unpivot =['id'],
row_keys='column_name',
col_name_key='column_value',
col_value_key=['va', 'vb'],
value_cols )
id
), which columns we want to take values from (va
and vb
), and how we want those values laid-out after the unpivot (by column_name
and column_value
). The effect of the transform is, it re-encodes rows of the data frame as identifiable groups of rows.
unpivot.transform(d)
id | column_name | column_value | |
---|---|---|---|
0 | 1 | va | a |
1 | 1 | vb | e |
2 | 2 | va | b |
3 | 2 | vb | e |
4 | 3 | va | a |
5 | 3 | vb | g |
6 | 4 | va | c |
7 | 4 | vb | f |
id
now occurs twice. The important observation is: in remapping all of our variables is just a single join against m
using column_name
and column_value
as composite join keys. The entire transform, join, and then inverse transform (back into all entries in single row format: a fundamental form we call a “row record”).
= def_multi_column_map(
ops =d),
descr(d=descr(m=m),
mapping_table=['id'],
row_keys=['va', 'vb'],
cols_to_map=0.0,
coalesce_value=['va_mapped', 'vb_mapped'],
cols_to_map_back )
ops
to re-process d
into the following.
= pd.DataFrame({
expect 'id': [1, 2, 3, 4],
'va_mapped': [1.0, 2.0, 1.0, 0.0],
'vb_mapped': [3.0, 3.0, 0.0, 4.0],
})
expect
id | va_mapped | vb_mapped | |
---|---|---|---|
0 | 1 | 1.0 | 3.0 |
1 | 2 | 2.0 | 3.0 |
2 | 3 | 1.0 | 0.0 |
3 | 4 | 0.0 | 4.0 |
= ops.eval({'d': d, 'm': m})
res
res
id | va_mapped | vb_mapped | |
---|---|---|---|
0 | 1 | 1.0 | 3.0 |
1 | 2 | 2.0 | 3.0 |
2 | 3 | 1.0 | 0.0 |
3 | 4 | 0.0 | 4.0 |
assert data_algebra.test_util.equivalent_frames(res, expect)
In Database
ops
is a data algebra pipeline (actually a directed acyclic graph, or DAG). Being such, it also can be run on adapted database by automatic translation to SQL.
Let’s see this run in the Google BigQuery database.
First we build our connection, and insert our example data. Of course, the point of using a database is that the data is usually already there (not something we insert).
# try it in database
= data_algebra.BigQuery.example_handle()
db_handle ='d', allow_overwrite=True)
db_handle.insert_table(d, table_name='m', allow_overwrite=True)
db_handle.insert_table(m, table_name"merged") db_handle.drop_table(
db_handle.execute(f"CREATE TABLE {db_handle.db_model.table_prefix}.merged AS {db_handle.to_sql(ops)}")
= db_handle.read_query(
db_res f"SELECT * FROM {db_handle.db_model.table_prefix}.merged ORDER BY id")
db_res
va_mapped | vb_mapped | id | |
---|---|---|---|
0 | 1.0 | 3.0 | 1 |
1 | 2.0 | 3.0 | 2 |
2 | 1.0 | 0.0 | 3 |
3 | 0.0 | 4.0 | 4 |
assert data_algebra.test_util.equivalent_frames(db_res, expect)
Conclusion
We showed how to efficiently perform a large number of variable re-mappings in a SQL database. Variable re-mapping or lookup is a fundamental step for data analytics and data science. We can translate the process to SQL’s primary operation: the JOIN. Some care must be taken to translate into a small number of operations over a large number of rows, as this is where relational databases shine.
Appendix
The Data Algebra Pipeline
We can take a look at the actual data algebra pipeline. It is, as promised, converting the records, doing one big join, and then converting the records back. Though it is nice to have a convenience function to write out such a pipeline for us (the huge advantage of programming over composable, inspectable, and value oriented APIs).
ops
(
TableDescription(table_name="d", column_names=["id", "va", "vb"])
.select_columns(["id", "va", "vb"])
.convert_records(
data_algebra.cdata.RecordMap(
blocks_in=None,
blocks_out=data_algebra.cdata.RecordSpecification(
record_keys=["id"],
control_table=pd.DataFrame(
{"column_name": ["va", "vb"], "column_value": ["va", "vb"],}
),
control_table_keys=["column_name"],
),
)
)
.natural_join(
b=TableDescription(
table_name="m", column_names=["column_name", "column_value", "mapped_value"]
).select_columns(["column_name", "column_value", "mapped_value"]),
by=["column_name", "column_value"],
jointype="LEFT",
)
.extend({"mapped_value": "mapped_value.coalesce(0.0)"})
.convert_records(
data_algebra.cdata.RecordMap(
blocks_in=data_algebra.cdata.RecordSpecification(
record_keys=["id"],
control_table=pd.DataFrame(
{"column_name": ["va", "vb"], "mapped_value": ["va", "vb"],}
),
control_table_keys=["column_name"],
),
blocks_out=None,
)
)
.rename_columns({"va_mapped": "va", "vb_mapped": "vb"})
)
The SQL
We can also take a peek at the SQL realizing this pipeline over Google BigQuery. It is large, because the record transformation steps are themselves realized in terms of SQL primitives (unpivots are joins, and pivots are aggregations). The main thing we can say about this query is, we didn’t have to write it! And yes, it is machine generated SQL deliberately targeting a simple sub-grammar of the language. So hand-rolled SQL would be smaller.
print(db_handle.to_sql(ops))
-- data_algebra SQL https://github.com/WinVector/data_algebra
-- dialect: BigQueryModel
-- string quote: "
-- identifier quote: `
WITH
`convert_records_blocks_out_0` AS (
-- convert records blocks out
SELECT
a.`id` AS `id`,
b.`column_name` AS `column_name`,
CASE WHEN CAST(b.`column_value` AS STRING) = "va" THEN a.`va` WHEN CAST(b.`column_value` AS STRING) = "vb" THEN a.`vb` ELSE NULL END AS `column_value`
FROM ( SELECT * FROM
`data-algebra-test.test_1.d`
) a
CROSS JOIN (
SELECT
*
FROM (
(SELECT "va" AS `column_name`, "va" AS `column_value`)
UNION ALL (SELECT "vb" AS `column_name`, "vb" AS `column_value`)
) `table_values`
) b
ORDER BY
a.`id`,
b.`column_name`
) ,
`natural_join_1` AS (
SELECT -- _0..natural_join(b= _1, by=['column_name', 'column_value'], jointype='LEFT')
COALESCE(`convert_records_blocks_out_0`.`column_name`, `data-algebra-test.test_1.m`.`column_name`) AS `column_name` ,
COALESCE(`convert_records_blocks_out_0`.`column_value`, `data-algebra-test.test_1.m`.`column_value`) AS `column_value` ,
`id` ,
`mapped_value`
FROM
(
`convert_records_blocks_out_0`
LEFT JOIN
`data-algebra-test.test_1.m`
ON (
`convert_records_blocks_out_0`.`column_name` = `data-algebra-test.test_1.m`.`column_name` AND
`convert_records_blocks_out_0`.`column_value` = `data-algebra-test.test_1.m`.`column_value`
)
)
) ,
`extend_2` AS (
SELECT -- .extend({ 'mapped_value': 'mapped_value.coalesce(0.0)'})
`id` ,
`column_name` ,
`column_value` ,
COALESCE(`mapped_value`, 0.0) AS `mapped_value`
FROM
`natural_join_1`
) ,
`convert_records_blocks_in_3` AS (
-- convert records blocks in
SELECT
`id` AS `id`,
MAX(CASE WHEN ( CAST(`column_name` AS STRING) = "va" ) THEN `mapped_value` ELSE NULL END) AS `va`,
MAX(CASE WHEN ( CAST(`column_name` AS STRING) = "vb" ) THEN `mapped_value` ELSE NULL END) AS `vb`
FROM ( SELECT * FROM
`extend_2`
) a
GROUP BY
`id`
ORDER BY
`id`
)
SELECT -- .rename_columns({'va_mapped': 'va', 'vb_mapped': 'vb'})
`va` AS `va_mapped` ,
`vb` AS `vb_mapped` ,
`id`
FROM
`convert_records_blocks_in_3`
Clean Up
# clean up
"df")
db_handle.drop_table("m")
db_handle.drop_table("xicor")
db_handle.drop_table(
db_handle.close()# show we made it to here, and did not assert earlier
print('done')
done
Categories: Opinion Pragmatic Data Science Tutorials