I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) […]

Estimated reading time: 17 minutes

Given the range of wants, diverse data sources, required innovation and methods it often feels like data science projects are immune to planning, scoping and tracking. Without a system to break a data science project into smaller observable components you greatly increase your risk of failure. As a followup to […]

Estimated reading time: 10 minutes

XCOM: Enemy Unknown is a turn based video game where the player choses among actions (for example shooting an alien) that are labeled with a declared probability of success. Image copyright Firaxis Games A lot of gamers, after missing a 80% chance of success shot, start asking if the game’s […]

Estimated reading time: 33 minutes

I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain. I would like to […]

Estimated reading time: 11 minutes

We have added a worked example to the README of our experimental logistic regression code. The Logistic codebase is designed to support experimentation on variations of logistic regression including: A pure Java implementation (thus directly usable in Java server environments). A simple multinomial implementation (that allows more than two possible […]

Estimated reading time: 11 minutes

Model level fit summaries can be tricky in R. A quick read of model fit summary data for factor levels can be misleading. We describe the issue and demonstrate techniques for dealing with them.

Estimated reading time: 12 minutes

When people ask me what it means to be a data scientist, I used to answer, “it means you don’t have to hold my hand.” By which I meant that as a data scientist (a consulting data scientist), I can handle the data collection, the data cleaning and wrangling, the […]

Estimated reading time: 6 minutes

Logistic Regression is a popular and effective technique for modeling categorical outcomes as a function of both continuous and categorical variables. The question is: how robust is it? Or: how robust are the common implementations? (note: we are using robust in a more standard English sense of performs well for […]

Estimated reading time: 13 minutes

What does a generalized linear model do? R supplies a modeling function called glm() that fits generalized linear models (abbreviated as GLMs). A natural question is what does it do and what problem is it solving for you? We work some examples and place generalized linear models in context with […]

Estimated reading time: 12 minutes

One of the shortcomings of regression (both linear and logistic) is that it doesn’t handle categorical variables with a very large number of possible values (for example, postal codes). You can get around this, of course, by going to another modeling technique, such as Naive Bayes; however, you lose some […]

Estimated reading time: 13 minutes