Menu Home

Talks and Presentations

2020

Advanced Data Preparation for Supervised Machine Learning

Why R Webinar, May 7, 2020
Speakers: Nina Zumel, John Mount

An introduction to the principles of the vtreat package for fitting machine learning models on messy real-world data, and to its R implementation.

2019

Preparing Messy Data for Supervised Learning (Python)

PyData 2019, Los Angeles, December 28, 2019
Speakers: Nina Zumel, John Mount

An introduction to the principles of the vtreat package for fitting machine learning models on messy real-world data, and to its Python implementation.

Practical Data Science with R

Bay Area R Users Group, September 3, 2019
Speakers: Nina Zumel, John Mount

A preview of our then about-to-be released second edition of Practical Data Science with R. We discussed the direction that the R community had taken since our first edition, and how this affected the second edition. Details

2018

rquery: a Query Generator for Working With SQL Data Sources From R

Bay Area R Users Group, May 8 2018
Speakers: John Mount

rquery is an R package for data wrangling on SQL databases and Spark. John introduces the use of rquery to produce “piped SQL” for working with remote SQL data sources via R.
Details; rquery github repository

Preparing Datasets – The Ugly Truth & Some Solutions

The East Bay R Language Beginners Group, May 1, 2018
Speakers: John Mount

An introduction to vtreat in the context of an evening of talks around issues encountered with real world datasets. Details

cdata: Fluid Data Transformations for R

Lightning Talk, Bay Area R Users Group, January 16 2018
Speakers: John Mount

A brief introduction to the R package cdata for reshaping data. Details; Slides

Our equivalent Python package is data_algebra, which combines the functionality from the R packages rquery, rqdatatable, and cdata.

2017

Myths of Data Science: Things You Should and Should Not Believe

ODSC West, 2017, November 2, 2017
Speakers: Nina Zumel

In this talk, we go back to fundamentals and look closely at some usually unexamined assumptions about statistics and machine learning. We debunk “myths” that arise in common data science tasks, and offer potential fixes to issues that can arise.
Abstract; Slides

Modeling big data with R, Sparklyr, and Apache Spark

Workshop, ODSC West 2017
Also given at Strata & Hadoop World, March 14, 2017 (in partnership with RStudio)
Speakers: John Mount

John demonstrates how to use sparklyr to analyze big data in Spark, covering filtering and manipulating Spark data to import into R and using R to run machine-learning algorithms on data in Spark. John also also explores the sparklyr integration built into the RStudio IDE.
Details; Slides and materials

Standard versus non-standard calling conventions in R: examples with dplyr and replyr

Lightning Talk, Bay Area R Users Group, February 7, 2017
Speakers: John Mount

John discusses the ease and utility of using standard or parametric (variable names as strings or as values inside other variables) versus non-standard (variable names captured directly from use expressions) in R, and argues the case for standard interfaces.

2016

Improving Prediction using Nested Models and Simulated Out-of-Sample Data

Women Who Code Silicon Valley, October 27, 2016
Speakers: Nina Zumel

A discussion of nested predictive models and how to properly fit them.
Details; Slides

Validating Models in R

R Day, Strata + Hadoop World San Jose, March 29 2016
Abbreviated version given at San Francisco Data Science ODSC Meetup, March 31, 2016
Speakers: Nina Zumel and John Mount

We demonstrate a number of techniques, R packages, and code for validating predictive models.
Slides and materials
Extracts from the talk (subscription to O’Reilly required for entire talk): Part 1, Part 2

2015

An Introduction to Differential Privacy as Applied to Machine Learning

Bay Area Women in Machine Learning and Data Science Meetup, San Francisco, December 02, 2015
Speakers: Nina Zumel

A brief introduction to the ideas behind differential privacy, and a review of how differential privacy can be used to enable safer re-use of holdout data in machine learning.
Details; Slides

Prepping Data for Analysis Using R

Workshop, ODSC West 2015, November 18, 2015
Speakers: John Mount and Nina Zumel

This workshop lays out the fundamentals of preparing data and provides interactive demonstrations in the open source R analysis environment.

Statistics in the age of data science, issues you can and can not ignore

Data Science Summit & DATO Conference, July 7, 2015
Speakers: John Mount

Statistical issues from a data science perspective. Slides and materials

2013

Teaching Data Science as an Interdisciplinary Activity (with R)

Bay Area R Users Group, August 21, 2013
Speakers: John Mount

Discussion of data science and R, in the context of our (then in-progress) book Practical Data Science with R.
Details; Slides