Let’s please stop saying somebody isn’t a data scientist if they haven’t memorized the innards of one obscure machine learning algorithm, or blow the right smoke during an interoo (“Kangaroo interview”, thanks Jim Ruppert for this term!). Let us, instead, think of the data scientist as the bus driver. It […]
I am sharing a new free video where I work through a great common argument that bounds expected excess generalization error as a ratio of model complexity (in rows) over training set size (again in rows), independent of problem dimension. (link) For more of my notes on support vector machines […]
I have a new short video lecture to share: “Classification as Censored Regression.”
I recently shared a bit of the history of The Science of Data Analysis. I thought I would follow that up with a quick chalk talk titled “What is Statistics?” (link)
I am re-reading from the great statistician John W. Tukey’s paper: Tukey, John W. “The Future of Data Analysis.” Ann. Math. Statist. 33 (1962), no. 1, pp. 1–67. doi:10.1214/aoms/1177704711. https://projecteuclid.org/euclid.aoms/1177704711 I’ve taken the liberty of pulling out some quotes that are very relevant to the usual “data science is not […]
I am working on a promising new series of notes: common data science fallacies and pitfalls. (Probably still looking for a good name for the series!) I thought I would share a few thoughts on it, and hopefully not jinx it too badly.
From the frontmatter: We recommend this book! Deep Learning for Coders with fastai and PyTorch uses advanced frameworks to move quickly through concrete, real-world artificial intelligence or automation tasks. This leaves time to cover usually neglected topics, like safely taking models to production and a much-needed chapter on data ethics. […]
For all our remote learners, we are sharing a free coupon code for our R video course Introduction to Data Science. The code is ITDS2020, and can be used at this URL https://www.udemy.com/course/introduction-to-data-science/?couponCode=ITDS2020 . Please check it out and share it!
Here is a small quote from Practical Data Science with R Chapter 1. It is often too much to ask for the data scientist to become a domain expert. However, in all cases the data scientist must develop strong domain empathy to help define and solve the right problems. Interested? […]
Students have asked me if it is better to use the same cross-validation plan in each step of an analysis or to use different ones. Our answer is: unless you are coordinating the many plans in some way (such as 2-way independence or some sort of combinatorial design) it is […]