Machine learning “in the database” (including systems such as Spark) is an increasingly popular topic. And where there is machine learning, there is a need for data preparation. Many machine learning algorithms expect all data to be numeric without missing values. vtreat is a package (available for Python or for R) that reliably converts fairly wild data into such a format. To support machine leaning in the database we are adding the ability to both export vtreat data preparations as data (so they can be later used by stored procedures) and as data algebra pipelines (so they can be immediately translated to executable SQL).
This note is a demonstration of converting a Python vtreat data preparation into a data algebra pipeline, which can then in turn be converted to SQL queries.
R vtreat already has similar functionality with as_rquery_plan().
Let’s work a simple problem.
First we import our modules.
In [1]:
import pandas as pd
from data_algebra.data_ops import *
import data_algebra.SQLite
import data_algebra.test_util
import vtreat
from vtreat.vtreat_db_adapter import as_data_algebra_pipeline
Now let’s bring in and arrange our data.
In [2]:
# Data from:
# https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008
data_all = pd.read_csv("diabetes_head.csv")
n = data_all.shape[0]
data_all['orig_index'] = range(n)
d_train = data_all.loc[range(n-5), :].reset_index(inplace=False, drop=True)
d_app = data_all.loc[range(n-5, n)].reset_index(inplace=False, drop=True)
We define our problem by declaring which columns is the dependent variable, which columns are potential explanitory variables, and any other columns we wish to cary around.
In [3]:
outcome_name = "readmitted"
cols_to_copy = ["orig_index", "encounter_id", "patient_nbr"] + [outcome_name]
vars = ['time_in_hospital', 'weight']
columns = vars + cols_to_copy
d_train.loc[:, columns]
Out[3]:
time_in_hospital | weight | orig_index | encounter_id | patient_nbr | readmitted | |
---|---|---|---|---|---|---|
0 | 1 | NaN | 0 | 2278392 | 8222157 | False |
1 | 2 | NaN | 1 | 64410 | 86047875 | False |
2 | 2 | NaN | 2 | 500364 | 82442376 | False |
3 | 3 | NaN | 3 | 35754 | 82637451 | False |
4 | 4 | NaN | 4 | 55842 | 84259809 | False |
… | … | … | … | … | … | … |
990 | 7 | NaN | 990 | 8834994 | 66138012 | False |
991 | 6 | NaN | 991 | 8851308 | 71896626 | False |
992 | 11 | NaN | 992 | 8856594 | 79469091 | False |
993 | 12 | NaN | 993 | 8857092 | 92251359 | True |
994 | 6 | NaN | 994 | 8857164 | 30360042 | False |
995 rows × 6 columns
Now we specify our vtreat data preparation scheme. Documentation and tutorials on these concepts can be found here.
In [4]:
treatment = vtreat.BinomialOutcomeTreatment(
cols_to_copy=cols_to_copy,
outcome_name=outcome_name,
outcome_target=True,
params=vtreat.vtreat_parameters(
{"sparse_indicators": False, "filter_to_recommended": False,}
),
)
d_train_treated = treatment.fit_transform(d_train.loc[:, columns])
We can apply this data treatment to new data.
In [5]:
d_app_treated = treatment.transform(d_app.loc[:, columns])
d_app_treated
Out[5]:
orig_index | encounter_id | patient_nbr | readmitted | weight_is_bad | time_in_hospital | weight_logit_code | weight_prevalence_code | weight_lev__NA_ | |
---|---|---|---|---|---|---|---|---|---|
0 | 995 | 8860284 | 94419315 | False | 1.0 | 3.0 | 0.005825 | 0.99397 | 1.0 |
1 | 996 | 8860944 | 338247 | False | 1.0 | 4.0 | 0.005825 | 0.99397 | 1.0 |
2 | 997 | 8864718 | 695439 | False | 1.0 | 1.0 | 0.005825 | 0.99397 | 1.0 |
3 | 998 | 8866632 | 103586670 | False | 0.0 | 6.0 | 0.000000 | 0.00000 | 0.0 |
4 | 999 | 8867106 | 4988970 | False | 1.0 | 9.0 | 0.005825 | 0.99397 | 1.0 |
Now for the feature that is new for vtreat version 1.0.1 (not yet released to PyPi). We can export the entire fit data preparation plan as a single table.
In [6]:
transform_as_data = treatment.description_matrix()
transform_as_data
Out[6]:
treatment_class | treatment | orig_var | variable | value | replacement | |
---|---|---|---|---|---|---|
0 | IndicateMissingTransform | missing_indicator | weight | weight_is_bad | _NA_ | 1.000000e+00 |
1 | CleanNumericTransform | clean_copy | time_in_hospital | time_in_hospital | _NA_ | 4.803015e+00 |
2 | MappedCodeTransform | logit_code | weight | weight_logit_code | [0-25) | -1.110223e-16 |
3 | MappedCodeTransform | logit_code | weight | weight_logit_code | [50-75) | -1.110223e-16 |
4 | MappedCodeTransform | logit_code | weight | weight_logit_code | [75-100) | -2.253714e+00 |
5 | MappedCodeTransform | logit_code | weight | weight_logit_code | _NA_ | 5.825087e-03 |
6 | MappedCodeTransform | prevalence_code | weight | weight_prevalence_code | [0-25) | 1.005025e-03 |
7 | MappedCodeTransform | prevalence_code | weight | weight_prevalence_code | [50-75) | 1.005025e-03 |
8 | MappedCodeTransform | prevalence_code | weight | weight_prevalence_code | [75-100) | 4.020101e-03 |
9 | MappedCodeTransform | prevalence_code | weight | weight_prevalence_code | _NA_ | 9.939698e-01 |
10 | IndicatorCodeTransform | indicator_code | weight | weight_lev__NA_ | _NA_ | 1.000000e+00 |
It is a simple matter to write a procedure (or in the case of databases, as stored procedure) that reproduces the vtreat data preparation from this table. For example vtreat itself now (in version 1.0.1) supplies a function that translates the table into a data algebra pipeline. This means we can run the data preparation in any database that we have a data algebra SQL adapter for!
Let’s see this translation in action.
In [7]:
ops = as_data_algebra_pipeline(
source=descr(d_app=d_app.loc[:, columns]),
vtreat_descr=transform_as_data,
treatment_table_name='transform_as_data',
)
# print(ops) # could print this, but it tends to be large!
In [8]:
transformed = ops.eval({
'd_app': d_app.loc[:, columns],
'transform_as_data': transform_as_data})
transformed
Out[8]:
time_in_hospital | orig_index | encounter_id | patient_nbr | readmitted | weight_is_bad | weight_lev__NA_ | weight_logit_code | weight_prevalence_code | |
---|---|---|---|---|---|---|---|---|---|
0 | 3 | 995 | 8860284 | 94419315 | False | 1.0 | 1.0 | 0.005825 | 0.99397 |
1 | 4 | 996 | 8860944 | 338247 | False | 1.0 | 1.0 | 0.005825 | 0.99397 |
2 | 1 | 997 | 8864718 | 695439 | False | 1.0 | 1.0 | 0.005825 | 0.99397 |
3 | 6 | 998 | 8866632 | 103586670 | False | 0.0 | 0.0 | 0.000000 | 0.00000 |
4 | 9 | 999 | 8867106 | 4988970 | False | 1.0 | 1.0 | 0.005825 | 0.99397 |
In [9]:
assert data_algebra.test_util.equivalent_frames(transformed, d_app_treated)
We can then run the same operations in an SQL database we have an adapter for. Currently, we have good adapters for Google Big Query, Spark, PostgreSQL, MySQL, and SQLite. The data algebra has extension classes designed to make producing new database adapters easy.
Let’s simply use SQLite as a convenient example.
In [10]:
db_handle = data_algebra.SQLite.example_handle()
sql = db_handle.to_sql(ops)
# print(sql) # could print this, but it tends to be large!
In [11]:
db_handle.insert_table(d_app.loc[:, columns], table_name='d_app')
db_handle.insert_table(transform_as_data, table_name='transform_as_data')
db_handle.execute('CREATE TABLE res AS ' + sql)
res_db = db_handle.read_query('SELECT * FROM res ORDER BY orig_index LIMIT 10')
res_db
Out[11]:
patient_nbr | weight_is_bad | encounter_id | readmitted | orig_index | weight_lev__NA_ | weight_logit_code | weight_prevalence_code | time_in_hospital | |
---|---|---|---|---|---|---|---|---|---|
0 | 94419315 | 1.0 | 8860284 | 0 | 995 | 1.0 | 0.005825 | 0.99397 | 3 |
1 | 338247 | 1.0 | 8860944 | 0 | 996 | 1.0 | 0.005825 | 0.99397 | 4 |
2 | 695439 | 1.0 | 8864718 | 0 | 997 | 1.0 | 0.005825 | 0.99397 | 1 |
3 | 103586670 | 0.0 | 8866632 | 0 | 998 | 0.0 | 0.000000 | 0.00000 | 6 |
4 | 4988970 | 1.0 | 8867106 | 0 | 999 | 1.0 | 0.005825 | 0.99397 | 9 |
In [12]:
assert data_algebra.test_util.equivalent_frames(res_db, d_app_treated)
In [13]:
db_handle.close()
And that is it: advanced data preparation directly in the database. We train the vtreat data preparation in-memory, but it now can be exported and used many more places at much greater scale.
Categories: Exciting Techniques Pragmatic Data Science Tutorials
I discuss larger scale exports by SQL UPDATE here: https://github.com/WinVector/pyvtreat/blob/main/Examples/Database/update_joins.ipynb