A kind reader recently shared the following comment on the Practical Data Science with R 2nd Edition live-site. Thanks for the chapter on data frames and data.tables. It has helped me overcome an obstacle freeing me from a lot of warnings telling me my data table was not a real […]
I’d like to share some new timings on a grouped in-place aggregation task. A client of mine was seeing some slow performance, so I decided to time a very simple abstraction of one of the steps of their workflow.
This article introduces the data_algebra project: a data processing tool family available in R and Python. These tools are designed to transform data either in-memory or on remote databases. In particular we will discuss the Python implementation (also called data_algebra) and its relation to the mature R implementations (rquery and […]
In this note we share a quick study timing how long it takes to perform some simple data manipulation tasks with R data.frames. We are interested in the time needed to select a column, alter a column, or select a row. Knowing what is fast and what is slow is […]
Let’s try some "ugly corner cases" for data manipulation in R. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong. Let’s see what happens when we try to […]
This note is a comment on some of the timings shared in the dplyr-0.8.0 pre-release announcement. The original published timings were as follows: With performance metrics: measurements are marketing. So let’s dig in the above a bit.
Our interference from then environment issue was a bit subtle. But there are variations that can be a bit more insidious. Please consider the following.
It is no great secret: I like value oriented interfaces that preserve referential transparency. It is the side of the public debate I take in R programming. "One of the most useful properties of expressions is that called by Quine referential transparency. In essence this means that if we wish […]
I’ve ended up (almost accidentally) collecting a number of different solutions to the “use a column to choose values from other columns in R” problem. Please read on for a brief benchmark comparing these methods/solutions.
We recently saw a great recurring R question: “how do you use one column to choose a different value for each row?” That is: how do you use a column as an index? Please read on for some idiomatic base R, data.table, and dplyr solutions.