Menu Home

Using closures as objects in R

For more and more clients we have been using a nice coding pattern taught to us by Garrett Grolemund in his book Hands-On Programming with R: make a function that returns a list of functions. This turns out to be a classic functional programming techique: use closures to implement objects (terminology we will explain).

It is a pattern we strongly recommend, but with one caveat: it can leak references similar to the manner described in here. Once you work out how to stomp out the reference leaks the “function that returns a list of functions” pattern is really strong.

We will discuss this programming pattern and how to use it effectively. 

Object oriented R

Simulating objects with the “function returning list of functions” pattern

In Hands-On Programming with R Garrett Grolement recommends a programming pattern of building a function that returns a list of functions. This is a pretty powerful pattern that uses a “closures” to get make a convenient object oriented programming pattern available to the R user.

At first this might seem unnecessary: R claims to already have many object oriented systems: S3, S4, and RC. But none of these conveniently present object oriented behavior as a programmer might expect from more classic object oriented languages (C++, Java, Python, Smalltalk, Simula …).

What are “objects”?

Like it or not object oriented programming is a programming style centered around sending messages to mutable objects. Roughly in object oriented programming you expect the following. There are data items (called objects, best thought of as “nouns”) that carry type information, a number of values (fields, like a structure), and methods or functions (which are sometimes thought of as verbs or messages). We expect objects to implement the following:

  • polymorphism: The same method or function call may have different implementations depending on the runtime type of one or more of its arguments. This allows important separation of concerns and, generic composition. Users of an object claiming to model a 2d region that has an area() method don’t need to know if they are dealing with a square or a circle and therefore can be mode to work over both types of shapes.
  • encapsulation: fields can be hidden from casual outside observers. This allows changes of implementation, as well behaved outside code can restrict its interactions to working with only publicly exposed methods and fields.
  • mutability: It is expected that some functions/methods are “verbs” or “messages” that cause fields in the object to change value. Immutable values are very popular in functional programming, and their certainly are such things as immutable objects. But the orientation of object oriented programming has historically been objects that change state in response to messages (such as: “increment customer count.”)
  • inheritance: objects can easily delegate parts of their implementation and declared method interfaces to other objects.

Standard R objects

None of the common object systems in R conveniently offer the majority of these behaviors, the issues are:

  • S3: polymorphism is a name-lookup hack associated more with methods than objects, there is no encapsulation, fields are immutable (as almost R structures are), and while objects can declare more than one class there is no real inheritance.
  • S4: considered a unreliable and expensive attempt model C++’s object system. Not recommended by many R experts and style guides. For example from the Google R style guide: “avoid S4 objects and methods when possible; never mix S3 and S4”.
  • RC: reference object system. So different from expectations in the rest of R should not be used unless you have a specific need for it.

Immutability

One thing that might surprise some readers (even though familiar with R) is we said almost all R objects are immutable. At first glance this doesn’t seem to be the case consider the following:

 

a <- list()
print(1)
## list()
a$b <- 1
print(a)
## $b
## [1] 1

 

The list “a” sure seemed to change. In fact it did not, this is an illusion foisted on you by R using some clever variable rebinding. Let’s look at that code more closely:

 

library('pryr')
## a <- list()
print(address(a))
[1] "0x1059c5dc0"
a$b <- 1
## print(address(a))
## [1] "0x105230668"

 

R simulated a mutation or change on the object “a” by re-binding a new value (the list with the extra argument) to the symbol “a” in the environment we were executing in. We see this by the address change, the name “a” is no longer referring to the same value. “Environment” is a computer science term meaning a structure that binds variable names to values. R is very unusual in that most R values are immutable and R environments are mutable (what value a variable refers to get changed out from under you). At first glance R appears to be adding an item to our list “a”, but in fact what is doing is changing the variable name “a” to refer to an entirely new list that has one more element.

This is why we say S3 objects are in fact immutable when the appear to accept changes. The issue is if you attempt to change an S3 object only the one reference in your current environment will see the change, any other references bound to the original value will keep their binding to the original value and not see any update. For the most part this is good. It prevents a whole slough of “oops I only wanted to update my copy during calculation but clobbered everybody else’s value” bugs. But it also means you can’t easily use S3 objects to share changing state among different processes.

Closures: “poor man’s objects”

There are some cases where you do want shared changing state. Garrett uses a nice example of drawing cards, we will use a simple example of assigning sequential IDs. Consider the following code:

 

idSource <- function() {
  nextIdVal <- 1
  list(nextID=function() { 
    r <- nextIdVal
    nextIdVal <<- nextIdVal + 1
    r
  })
}

source <- idSource()
source$nextID()
## [1] 1
source$nextID()
## [1] 2

 

The idea is the following: in R a fresh environment (that is the structure binding variable names to values) is created during function evaluation. Any function created while evaluating our outer function has access to all variables in this environment (this environment is what is called a closure). So any names that appear free in the inner function (that is variable names that don’t have a definition in the inner function) end up referring to variable in the this new environment (or one of its parents if there is no name match). Since environments are mutable re-binding values in this secret environment gives us mutable slots. The first gotcha is the need to use <<- or assign() to effect changes in the secret environment.

This behaves a lot more like what Java or Python programmer would expect from an object and is fully idiomatic R. So if you want object-like behavior this is a tempting way to get it.

Encapsulation and inheritance

So we have shared mutable state and polymorphism, what about encapsulation and inheritance?

Essentially we do have encapsulation, you can’t find the data fields unless you deliberately poke around in the functions environments. The data fields are not obvious list elements, so we can consider them private.

Inheritance is a bit weaker. At best we could get what is called prototype inheritance if when we created a list of functions we started with a list of default functions that we pass through all of which do not get their names overridden by our new functions.

This is only “safety by convention” (so a different breed of object orientedness than Java, but similar to Python and Javascript where you can examine raw fields easilly).

Problems with R closures

There is one lingering problem with using R environments as closures: they can leak references causing unwanted memory bloat. The reason is as with so many things with R the implementation of closures is explicitly exposed to the user. This means we can’t say “a closure is the binding of free variables at the time a function was defined” (the more common usage of static or lexical closure), but instead “R functions simulate a closure by keeping an explicit reference to the environment that was active when the function was defined.” This allows weird code like the following:

 

f <- function() { print(x) }
x <- 5
f()
[1] 5

 

In many languages the inability to bind the name “x” to a value at the time of function definition would be a caught error. With R there is no error as long as some parent of the functions definition environment eventually binds some value to the name “x”.

But the real problem is that R keeps the whole environment around, including bits the interior function is not using. Consider the following code snippet:

 

library('biglm')
d <- data.frame(x=runif(100000))
d$y <- d$x>=runif(nrow(d))
formula <- 'y~x'

fitter <- function(formula,d) {
  model <- bigglm(as.formula(formula),
               d,
               family=binomial(link='logit'))
  list(predict=function(newd) {
     predict(model,
             newdata=newd,
             type='response')[,1]
  })
}

model <- fitter(formula,d)
print(head(model$predict(d)))

 

What we have done is used biglm to build a logistic regression model. We are using the “function that returns a list of functions” pattern to build a new predict() method that remembers to set the all-important type='response' argument and use the [,1] operator to convert biglm‘s matrix return type into the more standard numeric vector return type. I.e. we are using these function wrappers to hide many of the quirks of the particular fitter (need a family argument during fit, needed a type argument during predict, and returning matrix instead of vector) without having to bring in a training control package (such as caret, caret is a good package- but you should know how to implement similar effects yourself).

The hidden problem is the following: the closure or environment of the model captures the training data causing this training data to be retained (possibly wasting a lot of memory). We can see that with the following code:

 

ls(envir=environment(model$predict))
## [1] "d"       "formula" "model"  

 

This can be a big problem. A generalized linear model such as this logistic regression should really only cost storage proportional to the number of variables (in this case 1!). There is no reason to hold on to the entire data set after fitting. The leaked storage may not be obvious in all cases as the standard R size functions don’t report space used in sub-environments and the “use serialization to guess size trick” (length(serialize(model, NULL))) doesn’t report the size of any objects in the global environment (so we won’t see the leak in this case where we ran fitter() in the global environment, but we would see it if we had run fitter in a function). As we see below the model object is large.

 

sizeTest1 <- function() {
  model <- fitter(formula,d)
  length(serialize(model, NULL))
}
sizeTest1()
## [1] 1227648

 

This is what we call a “reference leak.” R doesn’t tend to have memory leaks (it has a good garbage collector). But if you are holding a reference to an object you don’t need (and you may not even know you are holding the reference!) you have loss of memory that feels just like a leak.

Here is how to fix it: build a new restricted environment that has only what you need. Here is the code:

 

#' build a new funcion with a smaller environment
#' @param f input function
#' @param varaibles names we are allowing to be captured in the closere
#' @return new function with closure restricted to varaibles
#' @export
restrictEnvironment <- function(f,varList) {
  oldEnv <- environment(f)
  newEnv <- new.env(parent=parent.env(oldEnv))
  for(v in varList) {
    assign(v,get(v,envir=oldEnv),envir=newEnv)
  }
  environment(f) <- newEnv
  f
}

fitter <- function(formula,d) {
  model <- bigglm(as.formula(formula),
               d,
               family=binomial(link='logit'))
  model$family$variance <- c()
  model$family$dev.resids <- c()
  model$family$aic <- c()
  model$family$mu.eta <- c()
  model$family$initialize <- c()
  model$family$validmu <- c()
  model$family$valideta <- c()
  model$family$simulate <- c()
  environment(model$terms) <- new.env(parent=globalenv())
  list(predict=
         restrictEnvironment(function(newd) {
           predict(model,
                   newdata=newd,
                   type='response')[,1]
          },
          'model'))
}

 

The bulk of this code is us stripping large components out of the bigglm model. We have confirmed the model can still predict after this, though the summary functions are going to be broken. A lot of what we took out of the model are functions carrying environments that have a sneak reference to our data. We are not carrying multiple copies of the data, but we are carrying multiple references which will keep the data alive longer than we want. The part actually want to demonstrate was the following wrapper:

 

restrictEnvironment(function(newd) {
           predict(model,
                   newdata=newd,
                   type='response')[,1]
          },
          'model'))

 

What restrictEnvironment does is replace the function’s captured environment with a new one containing only the variables we listed. In this case we only listed “model” as this is the only variable we actually want to retain a reference to. For more than one function we would want a version of restrictEnvironment that uses a single shared environment for a list of functions.

The cleaning procedure is actually easy (except for when we have to clean items out of other people’s structures, as we had to here). Though there is the pain that since R doesn’t give you a list of the structures you need to retain (i.e. the list of unbound variable names in the inner function) you have to maintain this list by hand (which can get difficult if there are a lot of items, as if you list 10 you know you have forgotten one).

Example R code here.

 

Trying to remember which objects to allow in the captured closure environment. (Steve Martin “The Jerk” 1979, copyright the producers.)

Categories: Computer Science Public Service Article Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

13 replies

  1. Fantastic post on objects in R. Thanks for the same. My query is regarding immutablity of list object. The behaviour changes if we are running the following commands from console, vis-a-vis from inside a function.

    f <- function() {
      a <- list(b = 1)
      pryr::address(a)
      a$b <- 2
      pryr::address(a)
    }
    

    In above example, address of a does not change after changing b.
    If we run the following in console one command by one as

    a <- list(b = 1)
    pryr::address(a)
    a$b <- 2
    pryr::address(a)
    

    address of a changes after changing b.
    I could not get the reason for this difference.
    You may be able to help me understand.
    Thanks.

    1. Suman, great question. Thanks for sharing it.

      Minor quibble: I see the address changing if I call your code (had to add print to see the effect):

      # trying to observe address
      f <- function() {
        a <- list(b = 1)
        print(pryr::address(a))
        a$b <- 2
        print(pryr::address(a))
        a$b <- 3
        print(pryr::address(a))
      }
      f()
      ## [1] "0x7fc96fead638"
      ## [1] "0x7fc96f7d74a8"
      ## [1] "0x7fc96f7d7ca8"
      

      So part of the problem might be a bug in your example. However there is also a bit of an “observer effect” passing the object to pryr::address() (or even .Internal(inspect())) can cause a special R visibility optimizer to change behavior.

      Here is an effect very much like what are concerned about:

      It looks like in some situations where R knows an object isn’t visible (likely has a new unique reference in a single environment and we don’t expose it to a function) R can in fact mutate the object. The user can’t see this (except how it effects performance) because R doesn’t seem to apply this optimization unless it is pretty sure the user can’t see the object. Check out this benchmark!

      # timing
      f1c <- function(n,verbose,shadow) {
        v <- 1:n
        vLast <- c()
        if(shadow) {
          vLast <- v
        }
        if(verbose) {
          print(pryr::address(v))
        }
        for(i in 1:n) {
          v[i] <- v[i]^2
          if(shadow) {
             vLast <- v
          }
          if(verbose) {
             print(pryr::address(v))
          }
        }
        c()
      }
      
      f1c(5,TRUE,FALSE)
      ## [1] "0x7ff60e748958"
      ## [1] "0x7ff60e376750"
      ## [1] "0x7ff60e3767b8"
      ## [1] "0x7ff60e376820"
      ## [1] "0x7ff60e376888"
      ## [1] "0x7ff60e35e408"
      ## NULL
      print(system.time(f1c(30000,FALSE,FALSE)))
      ##    user  system elapsed 
      ##   0.033   0.002   0.035
      print(system.time(f1c(30000,FALSE,TRUE)))
      ##    user  system elapsed 
      ##   2.157   1.027   3.187
      

      We see the address changes if we look. But the code gets almost 100 times faster if we don’t make shadow copies of the reference. This makes me think R is mutating the copy as long as we don’t make the shadow reference.

      The idea is: if R was rebuilding the entire vector each time we update an entry then the shadow copy version should only be about twice as slow as the non-shadow even if R copied on reference generation (which it does not). The point being we would be at worst looking at the difference between two vector constructions (build new one on update and copy on reference update) versus one (build new vector on update, but no reference update). The plausible explanation of why causing a reference update slows the code down so much is that prior to the extra reference R was doing no vector re-allocations/constructions: it was altering in-place. The 100:1 slowdown likely means when we forced R to move from zero vector re-allocations/constructions to at least one vector allocation/construction per pass in the loop.

      In summary: we are trying to detect R’s object creation/copying strategy by writing code that would require O(n^2) time if R copies when we alter a single position and O(n) if it can work in place. The big difference in times (especially as we vary n) seems to show we have achieved this.

      1. Suman, sorry if I wasn’t clear. It took me a lot of experiments to figure this out, so probably a few experiments and you’ll know what you want.

  2. A cleaner approach than the environment surgery in restrictEnvironment is to write a closure creating function at the top level of your package or script, which will then only capture what you want. Here something like

        mkPredict <- function(model) {
            function(newd) {
                   predict(model,
                           newdata=newd,
                           type='response')[,1]
                  }
        }
    

    1. Luke, thanks for a really good point.

      I understand your discomfort with directly mucking around with environment references (as traditionally these are not directly exposed to the user). It definitely feels like we are coding “around the language” and an in-language construct would be more desirable. And the having a function that happens to have the right sort of closure is a clever idea.

      However, I feel it is a bit awkward to attempt to define enough “mkPredictX()” functions at the global/library level as such functions are possibly defined far away from their use and “kind of magic” (work due to where they are defined, a consequence of lexical scope- but something one would have to document to prevent things from breaking if somebody were tempted to clean up code by moving a function definition closer to its use).

      I do owe you an apology- I missed that you are also cutting the lexical inheritance chains shorter (which is a nice safe thing to do). So I overlooked a very good point on your part. It makes me wonder even in my directly mucking about code if it wouldn’t be better to use new.env(parent=globalenv()) instead of new.env(parent=parent.env(environment())) (fewer chances of capturing something you don’t want, but could break some code that requires weird deep lexical inheritance).

      And trying you code I found we need the slight modification (which I am sure you would have found quickly:

          mkPredict <- function(model) {
              model # force promise to be instantiated so it doesn't keep another environment alive
              function(newd) {
                     predict(model,
                             newdata=newd,
                             type='response')[,1]
                    }
          }
      

      Frankly the big weaknesses I would like to see fixed are:

      The awkwardness and danger of using the <<-- operator to attempt to reach into the closure (you can miss and hit your inner environment or reach out. Garrett used assign to explicitly hit the right environment, but that is pretty awkward).
      The having to explicitly list what you want to save. Not too bad when you only have one item (like “model”), but a pain when you have more than one (like “model”, “variable list”, “settings”, “bounds”, “errorHandler”, …).
      The pain of actually arranging the same environment to be shared among functions in a list when we return more than one. The code could be something like the following (though we could use lapply in place of some of the loops).

      #' @param flist list of input function
      #' @param varaibles names we are allowing to be captured in the closere
      #' @return new functions with closure restricted to varaibles
      #' @export
      restrictEnvironment <- function(flist,varList) {
        oldEnv <- environment(flist[[1]])
        newEnv <- new.env(parent=parent.env(oldEnv))
        newList <- list()
        for(v in varList) {
          assign(v,get(v,envir=oldEnv),envir=newEnv)
        }
        for(fn in names(flist)) {
          f <- flist[[fn]]
          environment(f) <- newEnv
          newList[[fn]] <- f
        }
        newList
      }
      

  3. Excellent post! You provide some very interesting approaches and comments on the rather obtuse and broken OO implementations for R. This is one of the biggest drawbacks of using R to teach general programming concepts alongside data analysis, as it’s almost useless to try to introduce OO concepts. Using something similar to the closure approach it might be possible to get to the big ideas without getting mired in the muck of S3/S4/RC.

    1. Thanks Forrest, this is a bit of spin-off of a new training module Nina and I are putting together. R’s functional programming features are an interesting topic. R has:

      Anonymous functions
      Immutable values
      Static/lexical closures
      Lazy evaluation of arguments

      Functional programming is in a bit of resurgence now (Clojure, F#, Haskell, Javascript, Scheme, and many more) and that is much more central to R. R’s functional heritage is slightly obscured by its imperative look (the for-loops) and object add-ons (S3/S4/RC), but it is the core language is functional (first class functions, immutable values). The “gotcha” is the lack of modern language features (R’s functions are essentially lisp fexpr’s, no hygienic macros, no tail recursion elimination, questionable homioconicity, no static type system, essentially uni-typed runtime, …).

  4. Interesting post. For this type of OO programming, I have always been using the proto package. I’ll admit that I have never really looked under the hood, but only assumed that it was just a convenient wrapper around closures, plus some goodies. Clearly your analysis is way more rigorous than mine, so I am not trying to advertise for proto, just wondering if you have seen it before and your opinion (good or bad) about it. Here is code to play with:

    library(proto)
    ############################################
    idSource <- proto(nextIdVal = 1L)
    idSource$nextID <- function(.) {
       r <- .$nextIdVal
       .$nextIdVal <- .$nextIdVal + 1L
       r
    }
    
    source <- idSource$proto()
    source$nextID()
    source$nextID()
    ############################################
    Shape  <- proto(area = function(.) stop("area() is not implemented"))
    Circle <- Shape$proto(radius = NULL,
                          new    = function(., radius) .$proto(radius = radius),
                          area   = function(.) pi * .$radius ^ 2)
    # another style of implementation: 
    Square <- Shape$proto(side   = NULL)
    Square$new <- function(., side) .$proto(side = side)
    Square$area <- function(.) .$side ^ 2
    
    c1 <- Circle$new(radius = 10)
    c1$area()
    s1 <- Square$new(side = 10)
    s1$area()
    ############################################
    Fitter <- proto()
    Fitter$new <- function(., formula, d) {
       .$model <- bigglm(as.formula(formula), d,
                       family = binomial(link = 'logit'))
       .$model$family$variance   <- NULL
       .$model$family$dev.resids <- NULL
       .$model$family$aic        <- NULL
       .$model$family$mu.eta     <- NULL
       .$model$family$initialize <- NULL
       .$model$family$validmu    <- NULL
       .$model$family$valideta   <- NULL
       .$model$family$simulate   <- NULL
    
       .$Predict <- function(., newd) predict(.$model,
                                              newdata = newd,
                                              type = 'response')[,1]
       return(.)
    }
    
    library('biglm')
    d <- data.frame(x=runif(100000))
    d$y =runif(nrow(d))
    formula <- 'y~x'
    m <- Fitter$new(formula, d)
    print(head(m$Predict(d)))
    length(serialize(m, NULL))
    

    1. flodel, thanks for the great resource.

      My honest answer is I wasn’t aware of the proto package. However, I agree with the principles and the code you shared looks good. Mostly I have been using the list of functions pattern for adaption. I haven’t been writing simulations in R, and I find my analyses tend not to need mutable state.

      It looks like proto overrode some operators to give member access without using the dangerous <<- operator or the cumbersome assign() method. And that is very good, ad-hoc closures behave like objects as long as you don’t mess up. And also the closures don’t supply inheritance, and proto seems to address that.

      R is powerful, so with the right boilerplate and best-practices you think you can do anything. At some point you want to encapsulate these practices into a library. The neat thing is R being so powerful means the library itself can be implemented in R (not needing to break out of the language to add these effects).

      One minor reminder, your returned fitter is still large as you would need to wrap the defined functions with something like restrictEnvironment and shrink the environment of the model$terms closure. But I suspect you understand that and left that out to keep the code more readable.

  5. I’m using the package R6 to achieve something similar inside a package of mine. However, I am reluctant to expose objects based on environments to users, because their behavior when being copied is different from what most user expect:

    idSource <- function() {
      nextId <- 1
      list(nextID=function() { 
        r <- nextId
        nextId <<- nextId + 1
        r
      })
    }
    
    source <- idSource()
    source$nextID()
    ## [1] 1
    source$nextID()
    ## [1] 2
    
    source2 <- source
    
    source2$nextID()
    ## [1] 3
    source$nextID()
    ## [1] 4
    

    Because environments are copied by reference, all copies of source share a common counter.

    1. I also would not suggest exposing too many reference style objects to users- the reference semantics are not what an R user expects (so violate the R user’s expectations) but are what an object orient programmer expects. The R6 code works the same way as you describe (but obviously has a lot more convenient functionality than roll-your own objects):

      library('R6')
      
      IdSourceGenerator <- R6Class("IdSource",
          public = list(
            nextIDval=1,
            nextId = function() {
            r <- self$nextIDval
            self$nextIDval <- self$nextIDval+1
            r
          }
        )
      )
      source <- IdSourceGenerator$new()
      source$nextId()
      ## 1
      source$nextId()
      ## 2
      s2 <- source
      s2$nextId()
      ## 3
      

      I guess the question this blog-post is answering is: “how is something like R6 possible in R as a library without calling outside of the R language?” The answer is use of environments (though I think R6 is using environments that have empty parents, a very good idea, instead of just using the lexical closures that come from function execution).

      1. Yes, the R6 seems to be the good way to have objects in R. But it looks like R6 cannot sub-assign by reference.

        library(R6)
        library(data.table)
        
        a <- R6Class(
            classname = "a",
            public = list(
                x = 1:10,
                initialize = function() self,
                assign_by_ref = function() self$x[5:6] <- NA_integer_
            )
        )
        aa <- a$new()
        ls.str(aa)
        # assign_by_ref : function ()  
        # initialize : function ()  
        # x :  int [1:10] 1 2 3 4 5 6 7 8 9 10
        address(aa$x)
        # [1] "0x4416758"
        aa$assign_by_ref()
        aa$x
        # [1]  1  2  3  4 NA NA  7  8  9 10
        address(aa$x)
        # [1] "0x48851a8"
        
        DT <- setDT(list(x = 1:10))
        address(DT$x)
        # [1] "0x4898730"
        DT[5:6, x := NA_integer_]
        DT$x
        # [1]  1  2  3  4 NA NA  7  8  9 10
        address(DT$x)
        # [1] "0x4898730"
        


        gist

%d