We have our latest note on the theory of data wrangling up here. It discusses the roles of “block records” and “row records” in the cdata data transform tool. With that and the theory of how to design transforms, we think we have a pretty complete description of the system.
Estimated reading time: 25 seconds
We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.
Estimated reading time: 4 minutes
I need a few volunteers to please “test pilot” the development version of the R package cdata, please. Jacqueline Cochran: at the time of her death, no other pilot held more speed, distance, or altitude records in aviation history than Cochran.
Estimated reading time: 3 minutes
Win-Vector LLC recently announced the rquery R package, an operator based query generator. In this note I want to share some exciting and favorable initial rquery benchmark timings.
Estimated reading time: 5 minutes
A big “thank you!!!” to Microsoft for hosting our new introduction to seplyr. If you are working R and big data I think the seplyr package can be a valuable tool.
Estimated reading time: 1 minute
Authors: John Mount and Nina Zumel Introduction In teaching thinking in terms of coordinatized data we find the hardest operations to teach are joins and pivot. One thing we commented on is that moving data values into columns, or into a “thin” or entity/attribute/value form (often called “un-pivoting”, “stacking”, “melting” […]
Estimated reading time: 11 minutes
Authors: John Mount and Nina Zumel. Introduction It has been our experience when teaching the data wrangling part of data science that students often have difficulty understanding the conversion to and from row-oriented and column-oriented data formats (what is commonly called pivoting and un-pivoting). Boris Artzybasheff illustration Real trust and […]
Estimated reading time: 30 minutes