One of the great conveniences of performing a data science style analysis using Jupyter is that Jupyter notebooks are literate containers that combine code, text, results, and graphs. This is also one of the pain points in working with Jupyter notebooks with partners or with source control. That is: Jupyter […]

Estimated reading time: 5 minutes

Data preparation and cleaning are some of the most important steps of predictive analytic and data science tasks. They are laborious, where most of the errors are made, your last line of defense against a wild data, and hold the biggest opportunities for outcome improvement. No matter how much time […]

Estimated reading time: 3 minutes

Here is a video I made showing how R should not be considered “scarier” than Excel to analysts. One of the takeaway points: it is easier to email R procedures than Excel procedures. Win-Vector’s John Mount shows a simple analysis both in Excel and in R. A save of the […]

Estimated reading time: 1 minute

“Data Science” is obviously a trendy term making it way through the hype cycle. Either nobody is good enough to be a data scientist (unicorns) or everybody is too good to be a data scientist (or the truth is somewhere in the middle). Gartner hype cycle (Wikipedia). And there is […]

Estimated reading time: 1 minute

It has been popular to complain that the current terms “data science” and “big data” are so vague as to be meaningless. While these terms are quite high on the hype-cycle, even the American Statistical Association was forced to admit that data science is actually a real thing and exists. […]

Estimated reading time: 3 minutes

There’s a new post up at the ninazumel.com blog that looks at the statistics of “verification by multiplicity” — the statistical technique that is behind NASA’s announcement of 715 new planets that have been validated in the data from the Kepler Space Telescope. We normally don’t write about science here […]

Estimated reading time: 2 minutes