It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns measurement and process_that_produced_measurement. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, […]

Estimated reading time: 6 minutes

Imagine that in the course of your analysis, you regularly require summaries of numerical values. For some applications you want the mean of that quantity, plus/minus a standard deviation; for other applications you want the median, and perhaps an interval around the median based on the interquartile range (IQR). In […]

Estimated reading time: 4 minutes

I am working on some practical articles on variable selection, especially in the context of step-wise linear regression and logistic regression. One thing I noticed while preparing some examples is that summaries such as model quality (especially out of sample quality) and variable significances are not quite as simple as […]

Estimated reading time: 6 minutes

Monads are a formal theory of composition where programmers get to invoke some very abstract mathematics (category theory) to argue the minutia of annotating, scheduling, sequencing operations, and side effects. On the positive side the monad axioms are a guarantee that related ways of writing code are in fact substitutable […]

Estimated reading time: 11 minutes

Let’s take a break from statistics and data science to think a bit about programming language theory, and how the theory relates to the programming language used in the R analysis platform (the language is technically called “S”, but we are going to just call the whole analysis system “R”). […]

Estimated reading time: 19 minutes

We’ve just finished off a series of articles on some recent research results applying differential privacy to improve machine learning. Some of these results are pretty technical, so we thought it was worth working through concrete examples. And some of the original results are locked behind academic journal paywalls, so […]

Estimated reading time: 1 minute

One of the things I like about R is: because it is not used for systems programming you can expect to install your own current version of R without interference from some system version of R that is deliberately being held back at some older version (for reasons of script […]

Estimated reading time: 5 minutes

Modern text encoding is a convoluted mess where costs can easily exceed benefits. I admit we are in a world that has moved beyond ASCII (which at best served only English, and even then without full punctuation). But modern text encoding standards (utf-x, Unicode) have metastasized to the point you […]

Estimated reading time: 15 minutes

Here is a neat example of famous mathematician Pál Erdős (often rendered in English as Paul Erdős) writing like a programmer in 1961. He goes to some trouble to introduce notation that allows him to index everything from zero.

Estimated reading time: 1 minute

For more and more clients we have been using a nice coding pattern taught to us by Garrett Grolemund in his book Hands-On Programming with R: make a function that returns a list of functions. This turns out to be a classic functional programming techique: use closures to implement objects […]

Estimated reading time: 19 minutes