## Data Science, Machine Learning, and Statistics: what is in a name?

A fair complaint when seeing yet another “data science” article is to say: “this is just medical statistics” or “this is already part of bioinformatics.” We certainly label many articles as “data science” on this blog. Probably the complaint is slightly cleaner if phrased as “this is already known statistics.” […]

## Don’t use correlation to track prediction performance

Using correlation to track model performance is “a mistake that nobody would ever make” combined with a vague “what would be wrong if I did do that” feeling. I hope after reading this feel a least a small urge to double check your work and presentations to make sure you […]

## Level fit summaries can be tricky in R

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.

## How robust is logistic regression?

Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for […]

## Modeling Trick: Impact Coding of Categorical Variables with Many Levels

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some […]

## Correlation and R-Squared

What is R2? In the context of predictive models (usually linear regression), where y is the true outcome, and f is the model’s prediction, the definition that I see most often is: In words, R2 is a measure of how much of the variance in y is explained by the […]

## The Simpler Derivation of Logistic Regression

Logistic regression is one of the most popular ways to fit models for categorical data, especially for binary response data. It is the most important (and probably most used) member of a class of models called generalized linear models. Unlike linear regression, logistic regression can directly predict probabilities (values that […]

## Your Data is Never the Right Shape

One of the recurring frustrations in data analytics is that your data is never in the right shape. Worst case: you are not aware of this and every step you attempt is more expensive, less reliable and less informative than you would want. Best case: you notice this and have […]

## Learn Logistic Regression (and beyond)

One of the current best tools in the machine learning toolbox is the 1930s statistical technique called logistic regression. We explain how to add professional quality logistic regression to your analytic repertoire and describe a bit beyond that.

## Living in A Lognormal World

Recently, we had a client come to us with (among other things) the following question: Who is more valuable, Customer Type A, or Customer Type B? This client already tracked the net profit and loss generated by every customer who used his services, and had begun to analyze his customers […]