Introduction
In many data science projects we have the data, but it “is in the wrong format.” Fortunately re-formatting or reshaping data is a solved problem, with many different available tools.
For this note, I would like to show how to reshape data using the data algebra‘s cdata data reshaping tool. This should give you familiarity with a tool to use on your own data.
Example
Let’s set up our Python worksheet and start with an example.
# import our modules and define utility functions
from typing import List
import pandas as pd
from data_algebra import RecordMap, RecordSpecification
from data_algebra.test_util import equivalent_frames
import data_algebra.cdata
from IPython.display import display, HTML
# define formatted Pandas display
def display_formatted(
d: pd.DataFrame,
record_id_cols: List[str],
structure_id_cols: List[str],
) -> None:
"""
Display a Pandas table in Jupyter notebook
:param d: data to display format
:param record_id_cols: column names keying instances or records ids
:param structure_id_cols: column names keying parts of record structure
"""
display(HTML(data_algebra.cdata._format_table(
d,
record_id_cols=record_id_cols,
control_id_cols=structure_id_cols,
add_style=True,
)))
# Example data (values changed for legibility) from:
# https://github.com/WinVector/Examples/blob/main/calling_R_from_Python/sig_pow.ipynb
d_row_form = pd.DataFrame({
'x': [-0.14, -0.2],
'control': [4.96, 5.21],
'treatment': [1.069, 1.16196],
'control_tail': [2, 3],
'treatment_tail': [19, 11],
})
d_row_form
x | control | treatment | control_tail | treatment_tail | |
---|---|---|---|---|---|
0 | -0.14 | 4.96 | 1.06900 | 2 | 19 |
1 | -0.20 | 5.21 | 1.16196 | 3 | 11 |
In many projects one isn’t producing the data, so one doesn’t have direct control of the initial record format.
# pick one row out as a simpler example
example_row = d_row_form.iloc[[0], :]
example_row
x | control | treatment | control_tail | treatment_tail | |
---|---|---|---|---|---|
0 | -0.14 | 4.96 | 1.069 | 2 | 19 |
# specify, after hard work, what how we wish the example row
# was structured
d_want = pd.DataFrame({
'x': -0.14,
'group': ['treatment', 'control'],
'y': [1.069, 4.96],
'tail': [19, 2],
})
d_want
x | group | y | tail | |
---|---|---|---|---|
0 | -0.14 | treatment | 1.069 | 19 |
1 | -0.14 | control | 4.960 | 2 |
Solution
We solve our problem by converting d_want
into a data record specification by:
- Restricting to the “record data content portion” of the data. We treat the
x
column as a record key (not content) and exclude it. - Replacing the example values with value names.
In our case the specification looks like the following.
# convert the example result into a data specification
d_specification = pd.DataFrame({
"group": d_want["group"],
"y": d_want["group"],
"tail": [k + "_tail" for k in d_want["group"]],
})
d_specification
group | y | tail | |
---|---|---|---|
0 | treatment | treatment | treatment_tail |
1 | control | control | control_tail |
d_want
with names instead of values. The names in the interior of the table are taken from the column names of our original row data.
Now we take this symbolic data frame and turn it into a complete data record specification by using the RecordSpecification
class. To do this we specify:
- What the data record content or values block looks like.
- Which key columns uniquely identify records (
record_keys = ['x']
). - Which key columns identify structure within a record (
control_table_keys = ['group']
).
# upgrade the data specification into a record specification
rs = RecordSpecification(
d_specification,
record_keys=['x'],
control_table_keys=['group'],
)
rs
- record_keys: ['x']
- control_table_keys: ['group']
- control_table:
record structure value group y tail 0 treatment treatment treatment_tail 1 control control control_tail
RecordSpecification
class organizes all of the above concerns together. The record_keys
indicate what combination of columns identifies instances (or rows). The control_table_key
identifies what combination of columns indicate internal row structure of a record (in neither case do we depend on row order).
A core interface to the data_algebra
cdata
data shaping interface is the RecordMap
. A RecordMap
takes two primary arguments:
blocks_in
: the specification of the incoming records (None
used to specify single row records).blocks_out
: the specification of outgoing records (None
used to specify single row records).
# show RecordMap take input and output RecordSpecification(s).
help(RecordMap.__init__)
Help on function __init__ in module data_algebra.cdata: __init__(self, *, blocks_in: Optional[data_algebra.cdata.RecordSpecification] = None, blocks_out: Optional[data_algebra.cdata.RecordSpecification] = None, strict: bool = True) Build the transform specification. At least one of blocks_in or blocks_out must not be None. :param blocks_in: incoming record specification, None for row-records. :param blocks_out: outgoing record specification, None for row-records. :param strict: if True insist block be strict, and in and out blocks agree on row-form columns.
RecordSpecification
to generate a complete RecordMap
by asking for a transform to/from common useful formats. In our case the .map_from_rows()
specifies the transform we want.
# ask the record specification to design a map from
# rows to records of our specified form
map_from_rows = rs.map_from_rows()
map_from_rows
record id | value | ||||
---|---|---|---|---|---|
x | treatment | control | treatment_tail | control_tail | |
0 | x record key | treatment value | control value | treatment_tail value | control_tail value |
to records of the form:
record id | record structure | value | ||
---|---|---|---|---|
x | group | y | tail | |
0 | x record key | control | control value | control_tail value |
1 | x record key | treatment | treatment value | treatment_tail value |
record id
or record structure
) indicate where values are copied to and from.
Let’s see the transform in action.
# apply the mapping to our original example data
d_records = map_from_rows(d_row_form)
# format the results
display_formatted(
d_records,
record_id_cols=rs.record_keys,
structure_id_cols=rs.control_table_keys)
record id | record structure | value | ||
---|---|---|---|---|
x | group | y | tail | |
0 | -0.200000 | control | 5.210000 | 3 |
1 | -0.200000 | treatment | 1.161960 | 11 |
2 | -0.140000 | control | 4.960000 | 2 |
3 | -0.140000 | treatment | 1.069000 | 19 |
Playing a bit more with the system and data
The cdata
RecordSpecification
supplies four convenience methods:
.map_from_rows()
.map_to_rows()
.map_from_keyed_column()
.map_to_keyed_column()
.map_from_rows()
and .map_to_rows()
build a map between general record structure and rows. This is the core of the cdata
data “pivoting” system.
.map_from_keyed_column()
and .map_to_keyed_column()
build map between a general record structure and essentially RDF Triples. This has a number of direct applications. It is also direct support of concepts such as melt()
and cast()
.
About 90% of data reshaping tasks are actually simple maps between “row records” (records where all data is in a single row) and “keyed columns” (or triples, where all but one column are keys). Our example above was a bit more general. For fully general transforms one directly instantiates a RecordMap
class directly specifying the desired input and output record specifications.
Let’s get more familiar with the cdata
system by trying a few more transformations on our example data.
# define the inverse map from block records to rows
inv_map = map_from_rows.inverse()
d_recovered = inv_map(d_records)
assert equivalent_frames(d_recovered, d_row_form)
# display our recovered row form
display_formatted(
d_recovered,
record_id_cols=rs.row_record_form().record_keys,
structure_id_cols=rs.row_record_form().control_table_keys)
record id | value | ||||
---|---|---|---|---|---|
x | control | control_tail | treatment | treatment_tail | |
0 | -0.200000 | 5.210000 | 3 | 1.161960 | 11 |
1 | -0.140000 | 4.960000 | 2 | 1.069000 | 19 |
# define map from block records to a keyed column format
kc_map = rs.map_to_keyed_column()
kc_map
record id | record structure | value | ||
---|---|---|---|---|
x | group | y | tail | |
0 | x record key | treatment | treatment value | treatment_tail value |
1 | x record key | control | control value | control_tail value |
to records of the form:
record id | record structure | value | |
---|---|---|---|
x | measure | value | |
0 | x record key | control | control value |
1 | x record key | control_tail | control_tail value |
2 | x record key | treatment | treatment value |
3 | x record key | treatment_tail | treatment_tail value |
# apply the map and display the results
triple_form = kc_map(d_records)
display_formatted(
triple_form,
record_id_cols=kc_map.blocks_out.record_keys,
structure_id_cols=kc_map.blocks_out.control_table_keys)
record id | record structure | value | |
---|---|---|---|
x | measure | value | |
0 | -0.200000 | control | 5.210000 |
1 | -0.200000 | control_tail | 3.000000 |
2 | -0.200000 | treatment | 1.161960 |
3 | -0.200000 | treatment_tail | 11.000000 |
4 | -0.140000 | control | 4.960000 |
5 | -0.140000 | control_tail | 2.000000 |
6 | -0.140000 | treatment | 1.069000 |
7 | -0.140000 | treatment_tail | 19.000000 |
# define the mapping
cols = ["treatment", "control", "treatment_tail", "control_tail"]
keyed_column_to_rows = RecordSpecification(
pd.DataFrame({
"measure": cols,
"value": cols,
}),
record_keys=['x'],
control_table_keys=['measure'],
).map_to_rows()
keyed_column_to_rows
record id | record structure | value | |
---|---|---|---|
x | measure | value | |
0 | x record key | treatment | treatment value |
1 | x record key | control | control value |
2 | x record key | treatment_tail | treatment_tail value |
3 | x record key | control_tail | control_tail value |
to records of the form:
record id | value | ||||
---|---|---|---|---|---|
x | control | control_tail | treatment | treatment_tail | |
0 | x record key | control value | control_tail value | treatment value | treatment_tail value |
# apply the mapping
rows_again = keyed_column_to_rows(triple_form)
assert equivalent_frames(rows_again, d_row_form)
display_formatted(
rows_again,
record_id_cols=['x'],
structure_id_cols=[])
record id | value | ||||
---|---|---|---|---|---|
x | control | control_tail | treatment | treatment_tail | |
0 | -0.200000 | 5.210000 | 3.000000 | 1.161960 | 11.000000 |
1 | -0.140000 | 4.960000 | 2.000000 | 1.069000 | 19.000000 |
Some theory and terminology
Now that we have worked some examples together, we have some shared experience we can use to support the following commentary.
Our theory of record transformation is centered on the following potted history.
- In statistics, data is usually organized into rows (instances) and columns (measurements). This is called a “design matrix” or “data frame”, and has a long history. This promotes a view that records are table rows for many tasks.
- In computer science, data was historically organized into general records or structured blocks that may have 1 or 2 dimensional or even nested structure. General records and structured record readers were a common extension feature in programming languages such as Fortran, COBOL, Basic, and Pascal.
- Relational databases, spreadsheets, and CSV file formats again promoted the view of records having a regular 1 dimensional structure. This obscured a lot of the earlier contributions.
- An alternate view of records having many rows, and only one value carrying column is the core of semantic triples or the entity–attribute–value model.
- Converting between record formats is sometimes called “pivoting“, “folding”, “stacking”, “melting”, or “casting” (though some of these operations combine aggregation).
Instead of locking on any of the above formulations, we endorse a fluid view of data from Codd’s “guaranteed access rule”:
Each and every datum (atomic value) in a relational data base is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
(I.e. record structure or row structure is just a convenience framework around locating and naming values by instance id and structure id.)
Given this context, the cdata
or coordinatized data concepts include:
- Records may have multiple rows and columns. Those with multiple rows are called “block records.”
- Records have two families of keys: those that say which individual we are referring to (record id keys) and those that say which fact in a record we are referring to (record structure keys).
- Records that have only one row are called “row records”, and are a fundamental representation.
- Semantic triples or the entity–attribute–value model are records where only one column is not keying information. We have taken to calling these “keyed columns” or “indexed columns”.
- Fluid conversion between record structures is critical. What is hard in one format, may be trivial in another (one example here). Some conversions may exchange record id keys and record structure keys, creating fundamentally new records.
Conclusion
I feel the specification of data transforms as example records to example records is a strong formulation for data reshaping or pivoting. I also feel that transforms between arbitrary records and rows (and not to keyed columns) is also fundamental. One of these transforms is in fact a join, and the other an aggregation- giving these operations a foundation based on the usual relational operators.
I think the above system has achieved a “data transforms specified as data” standard, which should turn out to be more flexible than transforms specified as code.
The core of cdata
is:
- Treat data as records
- Show how one record is transformed, and the software can repeat the operation on many records.
The above is all joint work with Dr. Nina Zumel.
The above ideas have implementations in Python as the data_algebra
cdata
methods and in R
as cdata
(available on PyPi and CRAN respectively). The data_algebra
(including cdata
transforms) works over Pandas, Polars, or in remote databases (without moving data out of the database, the transform itself can be translated into SQL).
Categories: Pragmatic Data Science Tutorials