It’s a common situation to have data from multiple processes in a “long” data format, for example a table with columns `measurement`

and `process_that_produced_measurement`

. It’s also natural to split that data apart to analyze or transform it, per-process — and then to bring the results of that data processing together, for comparison. Such a work pattern is called “Split-Apply-Combine,” and we discuss several R implementations of this pattern here. In this article we show a simple example of one such implementation, `replyr::gapply`

, from our latest package, `replyr`

.

Illustration by Boris Artzybasheff. Image: James Vaughn, some rights reserved.

The example task is to evaluate how several different models perform on the same classification problem, in terms of deviance, accuracy, precision and recall. We will use the “default of credit card clients” data set from the UCI Machine Learning Repository.

To keep this post short, we will skip over the preliminary data processing and the modeling; if you are interested, the code for the full example is available here. We will fit a logistic regression model (GLM), a generalized additive model (GAM), and a random forest model (`ranger`

implementation) to a training set, and evaluate the models’ performance on a hold-out set.

# load the file of model fitting and prediction functions source("modelfitting.R") algolist = list(glm=glm_predictor, gam=gam_predictor, rangerRF=ranger_predictor) # define outcome column and variables outcome = "defaults" varlist = ... # Fit models for each algorithm and gather together the # predictions each model makes on a test set. predictors = fit_models(algolist, outcome, varlist, train) predframe = make_predictions(predictors, test, outcome) library(replyr) replyr_summary(predframe)[, c("column", "class", "nunique")] ## column class nunique ## 2 defaults logical 2 ## 3 model character 3 ## 1 pred numeric 17973 replyr_uniqueValues(predframe, "model") ## # A tibble: 3 × 2 ## model n ## ## 1 gam 5997 ## 2 glm 5997 ## 3 rangerRF 5997

The results of the evaluation are in a single data frame `predframe`

, with columns `defaults`

(the true outcome: whether or not this customer defaulted on their loan in the next month); `pred`

(the predicted probability of default); and `model`

(the model that made the prediction).

To evaluate each model’s performance, we write a function `metric_row`

that takes a frame of predictions and true outcomes, and returns a data frame of all the performance metrics (deviance explained, accuracy, precision, and recall; the implementations for each metric are not shown here). This is the function we wish to apply to each group of data.

metric_row = function(subframe, yvar, pred, label) { confmat = cmat(subframe[[yvar]], subframe[[pred]]) devExplained = sigr::formatChiSqTest(subframe, pred, yvar)$pseudoR2 tframe = data.frame(devExplained=devExplained, accuracy=accuracy(confmat), precision=precision(confmat), recall=recall(confmat)) tframe$model = subframe[[label]][1] # assuming there is only one label tframe } # example outcome of metric_row, for the glm model metric_row(subset(predframe, model=="glm"), outcome, "pred", "glm") ## devExplained accuracy precision recall ## 1 0.1125283 0.8094047 0.7238979 0.2335329

In this case our data processing returns a one-row data frame but you could return a multirow frame. For example, if the data we process includes predictions for both the training and test sets, we could return a data frame with one row each for test and training performance.

We would like to use split-apply-combine on all the data, to return a frame of performance metrics for all the models that we evaluated. We can do that explicitly, of course (additionally sorted by deviance explained, descending):

# # Compute performance metrics for all the model types # Order by deviance explained # split(predframe, predframe$model) %>% lapply(function(fi) {metric_row(fi, outcome, 'pred', 'model')}) %>% dplyr::bind_rows() %>% dplyr::arrange(desc(devExplained))

`replyr::gapply`

provides a convenient function to wrap most of the above pipe.

# # Compute performance metrics for all the model types # Order by deviance explained # replyr::gapply(predframe, 'model', function(fi) metric_row(fi,outcome, 'pred', 'model'), partitionMethod = 'split') %>% dplyr::arrange(desc(devExplained)) ## devExplained accuracy precision recall model ## 1 0.1810591 0.8174087 0.6704385 0.3547904 gam ## 2 0.1767817 0.8180757 0.6680384 0.3645210 rangerRF ## 3 0.1125283 0.8094047 0.7238979 0.2335329 glm

The `partitionMethod = 'split'`

argument tells `gapply`

to split the data using `base::split`

, rather than partitioning the data using `dplyr::group_by`

before applying the user-supplied function. `dplyr::group_by`

is the default partitioning method, but isn’t suitable for the function (`metric_row`

) that I want to apply.

**Conclusion**

`replyr::gapply`

implements the split-order-apply pattern in a convenient wrapper function. It supports `dplyr`

grouped operations and explicit data partitioning (as in `base::split`

), and can be used on any `dplyr`

-supported back-end. The `replyr`

package is on CRAN; the most recent development version is available on Github.

Categories: Coding Computer Science data science Practical Data Science Pragmatic Data Science Pragmatic Machine Learning Programming Statistics

### nzumel

Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.