Menu Home

Advanced Data Reshaping in Python and R

This note is a simple data wrangling example worked using both the Python data_algebra package and the R cdata package. Both of these packages make data wrangling easy through he use of coordinatized data concepts (relying heavily on Codd’s “rule of access”). The advantages of data_algebra and cdata are: The […]

Introducing data_algebra

This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and […]

Big News: Porting vtreat to Python

We at Win-Vector LLC have some big news. We are finally porting a streamlined version of our R vtreat variable preparation package to Python. vtreat is a great system for preparing messy data for supervised machine learning. The new implementation is based on Pandas, and we are experimenting with pushing […]

Piping is Method Chaining

What R users now call piping, popularized by Stefan Milton Bache and Hadley Wickham, is inline function application (this is notationally similar to, but distinct from the powerful interprocess communication and concurrency tool introduced to Unix by Douglas McIlroy in 1973). In object oriented languages this sort of notation for […]

Is 10,000 Cells Big?

Trick question: is a 10,000 cell numeric data.frame big or small? In the era of “big data” 10,000 cells is minuscule. Such data could be fit on fewer than 1,000 punched cards (or less than half a box). The joking answer is: it is small when they are selling you […]