Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that […]

Estimated reading time: 14 minutes

Programmers should definitely know how to use R. I don’t mean they should switch from their current language to R, but they should think of R as a handy tool during development.

Estimated reading time: 11 minutes

One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have […]

Estimated reading time: 18 minutes

With the well deserved popularity of A/B testing computer scientists are finally becoming practicing statisticians. One part of experiment design that has always been particularly hard to teach is how to pick the size of your sample. The two points that are hard to communicate are that: The required sample […]

Estimated reading time: 18 minutes

This is a tutorial on how to try out a new package in R. The summary is: expect errors, search out errors and don’t start with the built in examples or real data. Suppose you want to try out a novel statistical technique? A good fraction of the time R […]

Estimated reading time: 14 minutes

Note February 11, 2020: this articles is out of date, we suggest using the methods of Using PostgreSQL in R: A quick how-to instead. We discuss a “medium scale data” technique that we call “SQL Screwdriver.” Previously we discussed some of the issues of large scale data analytics. A lot […]

Estimated reading time: 11 minutes

One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that.

Estimated reading time: 24 minutes

We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients.

Estimated reading time: 1 minute

This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion.

Estimated reading time: 36 minutes

Having worked with Unix (BSD, HPUX, IRIX, Linux and OSX), Windows (NT4, 2000, XP, Vista and 7) for quite a while I have seen a lot of different software tools. I would like to quickly exhibit my “must have” list. These are the packages that I find to be the […]

Estimated reading time: 4 minutes