I would like to share what I have found to be a very effective personal Jupyter workflow for data science development. DALL-E “An Effective Personal Jupyter Data Science Workflow” Jupyter (nee IPython) workbooks are JSON documents that allow a data scientist to mix: code, markdown, results, images, and graphs. They […]

Estimated reading time: 10 minutes

I am pleased to announce the 0.9.0 release of the data algebra. The data algebra is realization of the Codd relational algebra for data in written in terms of Python method chaining. It allows the concise clear specification of useful data transforms. Some examples can be found here. Benefits include […]

Estimated reading time: 1 minute

I have a new intermediate introduction on the data algebra up here: Using the data algebra for Statistics and Data Science. The data algebra is a tool for data processing in Python which is implemented on top of any of Pandas, Google BigQuery, PostgreSQL, MySQL, Spark, and SQLite. It allows […]

Estimated reading time: 37 seconds

I’ve thought of Pandas as in-memory column oriented data structure with reasonable performance. If I need high performance or scale, I can move to a database. I like Pandas, and thank the authors and maintainers for their efforts. Now I kind of wonder what Pandas is, or what it wants […]

Estimated reading time: 4 minutes

I’ve been tinkering a lot recently with the data_algebra, and just released version 0.7.0 to PyPi. In this note I’ll touch on what the data algebra is, what the new features are, and my plans going forward.

Estimated reading time: 10 minutes

I have up what I think is a really neat tutorial on how to plot multiple curves on a graph in Python, using seaborn and data_algebra. It is great way to show some data shaping theory convenience functions we have developed. Please check it out.

Estimated reading time: 23 seconds

There’s a common, yet easy to fix, mistake that I often see in machine learning and data science projects and teaching: using classification rules for classification problems. This statement is a bit of word-play which I will need to unroll a bit. However, the concrete advice is that you often […]

Estimated reading time: 6 minutes

We have a new improved version of the “how to design a cdata/data_algebra data transform” up! The original article, the Python example, and the R example have all been updated to use the new video. Please check it out!

Estimated reading time: 25 seconds