2022
Data Science: Street Fighting Statistics
Department of Statistics at the University of Illinois STAT 447: Data Science Programming Methods November 9, 2022
Speaker:John Mount
The talk was titled “Data Science: Street Fighting Statistics” and demonstrates two simple supervised modeling tasks in R.
A video of the lecture is here, the slides are here, and the code and data (and data attributions) are here.
2021
How and why to use probability models to outperform decision rules
USF Seminar Series in Data Science, April 30, 2021, 12:30-2pm PDT
Speaker:John Mount
In this talk we discuss how and why to work with probability models instead of hard classification rules, and demonstrate effective methods to evaluate and present probability models. We end with a nice system for later picking decision thresholds that maximize business utility. Example code in Python.
2020
Advanced Data Preparation for Supervised Machine Learning
Why R Webinar, May 7, 2020
Speakers: Nina Zumel, John Mount
An introduction to the principles of the vtreat
package for fitting machine learning models on messy real-world data, and to its R implementation.
2019
Preparing Messy Data for Supervised Learning (Python)
PyData 2019, Los Angeles, December 28, 2019
Speakers: Nina Zumel, John Mount
An introduction to the principles of the vtreat
package for fitting machine learning models on messy real-world data, and to its Python implementation.
Practical Data Science with R
Bay Area R Users Group, September 3, 2019
Speakers: Nina Zumel, John Mount
A preview of our then about-to-be released second edition of Practical Data Science with R. We discussed the direction that the R community had taken since our first edition, and how this affected the second edition. Details
2018
rquery
: a Query Generator for Working With SQL Data Sources From R
Bay Area R Users Group, May 8 2018
Speakers: John Mount
rquery
is an R package for data wrangling on SQL databases and Spark. John introduces the use of rquery
to produce “piped SQL” for working with remote SQL data sources via R.
Details; rquery
github repository
Preparing Datasets – The Ugly Truth & Some Solutions
The East Bay R Language Beginners Group, May 1, 2018
Speakers: John Mount
An introduction to vtreat
in the context of an evening of talks around issues encountered with real world datasets. Details
cdata
: Fluid Data Transformations for R
Lightning Talk, Bay Area R Users Group, January 16 2018
Speakers: John Mount
A brief introduction to the R package cdata
for reshaping data. Details; Slides
Our equivalent Python package is data_algebra, which combines the functionality from the R packages rquery, rqdatatable, and cdata.
2017
Myths of Data Science: Things You Should and Should Not Believe
ODSC West, 2017, November 2, 2017
Speakers: Nina Zumel
In this talk, we go back to fundamentals and look closely at some usually unexamined assumptions about statistics and machine learning. We debunk “myths” that arise in common data science tasks, and offer potential fixes to issues that can arise.
Abstract; Slides
Modeling big data with R, Sparklyr, and Apache Spark
Workshop, ODSC West 2017
Also given at Strata & Hadoop World, March 14, 2017 (in partnership with RStudio)
Speakers: John Mount
John demonstrates how to use sparklyr
to analyze big data in Spark, covering filtering and manipulating Spark data to import into R and using R to run machine-learning algorithms on data in Spark. John also also explores the sparklyr
integration built into the RStudio IDE.
Details; Slides and materials
Standard versus non-standard calling conventions in R: examples with dplyr
and replyr
Lightning Talk, Bay Area R Users Group, February 7, 2017
Speakers: John Mount
John discusses the ease and utility of using standard or parametric (variable names as strings or as values inside other variables) versus non-standard (variable names captured directly from use expressions) in R, and argues the case for standard interfaces.
2016
Improving Prediction using Nested Models and Simulated Out-of-Sample Data
Women Who Code Silicon Valley, October 27, 2016
Speakers: Nina Zumel
A discussion of nested predictive models and how to properly fit them.
Details; Slides
Validating Models in R
R Day, Strata + Hadoop World San Jose, March 29 2016
Abbreviated version given at San Francisco Data Science ODSC Meetup, March 31, 2016
Speakers: Nina Zumel and John Mount
We demonstrate a number of techniques, R packages, and code for validating predictive models.
Slides and materials
Extracts from the talk (subscription to O’Reilly required for entire talk): Part 1, Part 2
2015
An Introduction to Differential Privacy as Applied to Machine Learning
Bay Area Women in Machine Learning and Data Science Meetup, San Francisco, December 02, 2015
Speakers: Nina Zumel
A brief introduction to the ideas behind differential privacy, and a review of how differential privacy can be used to enable safer re-use of holdout data in machine learning.
Details; Slides
Prepping Data for Analysis Using R
Workshop, ODSC West 2015, November 18, 2015
Speakers: John Mount and Nina Zumel
This workshop lays out the fundamentals of preparing data and provides interactive demonstrations in the open source R analysis environment.
Statistics in the age of data science, issues you can and can not ignore
Data Science Summit & DATO Conference, July 7, 2015
Speakers: John Mount
Statistical issues from a data science perspective. Slides and materials
2013
Teaching Data Science as an Interdisciplinary Activity (with R)
Bay Area R Users Group, August 21, 2013
Speakers: John Mount
Discussion of data science and R, in the context of our (then in-progress) book Practical Data Science with R.
Details; Slides