Menu Home

Nifty Upcoming Enhancements to unpack/to

We have some really nifty upcoming enhancements to wrapr unpack/to.

One of the new notations is the use of := as an alternate assignment operator for unpack/to.

This lets us write code like the following.

First let’s attach our package and set up some example data.

library(wrapr)  # attach package
packageVersion("wrapr")  # confirm we have at least version 2.0.0
#> [1] ‘2.0.0’

# example data
d <- data.frame(
  x = 1:9,
  group = c('train', 'calibrate', 'test'),
  stringsAsFactors = FALSE)

base::split() is a very handy function for splitting a data frame into smaller data frames by group. For example:

print(split(d, d$group))

#> $calibrate
#>   x     group
#> 2 2 calibrate
#> 5 5 calibrate
#> 8 8 calibrate
#> 
#> $test
#>   x group
#> 3 3  test
#> 6 6  test
#> 9 9  test
#> 
#> $train
#>   x group
#> 1 1 train
#> 4 4 train
#> 7 7 train

Often we want these split data frame to be in our working environment, instead of trapped in a list. The usual way to achieve this would be to store the split list into a temporary variable and then assign elements of the list into our environment one at a time. This isn’t a problem, but it also isn’t as elegant as the following.

# assign split data into environment
unpack[
  traind = train, 
  testd = test, 
  cald = calibrate
  ] := split(d, d$group)

After this step our environment has the three split data frames, using names of our choosing. For example we have:

knitr::kable(traind)

x group
1 1 train
4 4 train
7 7 train

Notice we didn’t need to introduce a temporary variable to hold the list of splits. This is not a huge thing, but it more neatly documents intent. It is a small thing, but being elegant in the small things can help us achieve elegance in large projects.

This is a lot like the assignment version of unpack (already available on CRAN) works:

# assign split data into environment
unpack[
  traind = train, 
  testd = test, 
  cald = calibrate
  ] := split(d, d$group)

unpack and to has been designed to have very regular and versatile notation. If we prefer we can use arrows to specify the assignments.

# assign split data into environment
unpack[
  traind <- train, 
  testd <- test, 
  cald <- calibrate
  ] &lt- split(d, d$group)

Or we can use a pipe to assign to the right.

split(d, d$group) %.>% 
  unpack[
    traind <- train, 
    testd <- test, 
    cald <- calibrate
    ] 

And unpack can be also used in a more traditional non-operator notation as follows.

unpack(
  split(d, d$group),
  traind <- train, 
  testd <- test, 
  cald <- calibrate
)

An interesting side-note is how similar the above form is to the following.

with(
  split(d, d$group),
  {
    traind <<- train
    testd <<- test
    cald <<- calibrate
  }
)

Though we prefer not using <<-.

All of the above is covered in detail in the vignettes (here and here), and documentation (here and here). We also have some notes on managing workspaces with these methods plus here, and using unpack with functions that return named lists (such as those in vtreat) here.

To try these notations variations out before they are pushed to the CRAN version of wrapr, please try installing the development version of the package from GitHub as follows. (The CRAN version of wrapr already has most of the above features, but it doesn’t use := for the right to left outside assignment step yet (though := can already be used for specifying the interior mapping assignments).)

remotes::install_github("WinVector/wrapr")
packageVersion("wrapr")
#> [1] ‘2.0.0’

Categories: data science Exciting Techniques Statistics Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. Hi John, if you don’t know the package zeallot you might like it, it has a similar feature, you’d do :

    library(zeallot)
    c(traind, testd, cald) %<-% split(d, d$group)

    It does some fancier things that you can see in the examples.

    Like

    1. Thanks Antoine,

      I am a fan of the zeallot package, I refer to it in the unpack reference and the original feature announcement. My issue is positional unpacking is more natural for Python, as it is a language feature so Python functions tend to return tuples ready for such unpacking. For example, your line of code does not put the training data in the traind as split tends to use the order implied by the levels of the splitting factor (which in turn tend to be lexicographic, as group was of class character).

      For R the natural unpacking is by name, as functions evolved without a positional unpacker and tend to return named lists. The point being: we have stronger guarantees the right value gets unpacked to the right variable when using the names present in an R named list.

      I have some new notes on the differences here. The idea is: unpack declares intent instead of relying on chains of convention.

      Like

      1. I see…
        In this specific case we could still be quite readable with:
        c(traind, testd, cald) %<-% split(d, d$group)[c(“train”, “test”, “calibrate”)]
        But I get your larger point and agree, one on one relationships should be explicit as much as possible. That’s why many people like glue more than sprintf for instance. I’ll now read your notes 🙂

        Like

%d bloggers like this: