One of the attractive aspects of logistic regression models (and linear models in general) is their compactness: the size of the model grows in the number of coefficients, not in the size of the training data. With R, though, glm
models are not so concise; we noticed this to our dismay when we tried to automate fitting a moderate number of models (about 500 models, with on the order of 50 coefficients) to data sets of moderate size (several tens of thousands of rows). A workspace save of the models alone was in the tens of gigabytes! How is this possible? We decided to find out.
As many R users know (but often forget), a glm
model object carries a copy of its training data by default. You can use the settings y=FALSE
and model=FALSE
to turn this off.
set.seed(2325235) # Set up a synthetic classification problem of a given size # and two variables: one numeric, one categorical # (two levels). synthFrame = function(nrows) { d = data.frame(xN=rnorm(nrows), xC=sample(c('a','b'),size=nrows,replace=TRUE)) d$y = (d$xN + ifelse(d$xC=='a',0.2,-0.2) + rnorm(nrows))>0.5 d } # first show that model=F and y=F help reduce model size dTrain = synthFrame(1000) model1 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit')) model2 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE) model3 = glm(y~xN+xC,data=dTrain,family=binomial(link='logit'), y=FALSE, model=FALSE) # # Estimate the object's size as the size of its serialization # length(serialize(model1, NULL)) # [1] 225251 length(serialize(model2, NULL)) # [1] 206341 length(serialize(model3, NULL)) # [1] 189562 dTest = synthFrame(100) p1 = predict(model1, newdata=dTest, type='response') p2 = predict(model2, newdata=dTest, type='response') p3 = predict(model3, newdata=dTest, type='response') sum(abs(p1-p2)) # [1] 0 sum(abs(p1-p3)) # [1] 0
So we see (as expected) that removing the training data from the model decreases the size of the model (as estimated by the size of its serialization), without affecting the model’s predictions. What happens when you increase the training data size? The size of the model (with y=FALSE
and model=FALSE
) should not grow.
ndata = seq(from=0, to=100000, by=5000)[-1] # # A function to estimate the size of a model for # our synthetic problem, with a training set of size n # getModelSize = function(n) { data = synthFrame(n) model = glm(y~xN+xC,data=data,family=binomial(link='logit'), y=FALSE, model=FALSE) length(serialize(model, NULL)) } size1 = sapply(ndata, FUN=getModelSize) library(ggplot2) ggplot(data.frame(n=ndata, modelsize=size1), aes(x=n, y=modelsize)) + geom_point() + geom_line()
Lo and behold, we see that the model size still grows linearly in the size of the training data! The model objects are still holding something that is proportional to the size of the training data. Where?
We can use our serialization trick to find the size of the individual components of a model:
breakItDown = function(mod) { sapply(mod, FUN=function(x){length(serialize(x, NULL))}, simplify=T) }
Now let’s compare two models trained with datasets of different sizes (one ten times the size of the other).
mod1 = glm(y~xN+xC,data=synthFrame(1000), family=binomial(link='logit'), y=FALSE, model=FALSE) c1 = breakItDown(mod1) mod2 = glm(y~xN+xC,data=synthFrame(10000), family=binomial(link='logit'), y=FALSE, model=FALSE) c2 = breakItDown(mod2) # For pretty-printing a vector to a vertical blog-friendly format: # return a string of vector formatted as a column with names # use cat to echo the value vfmtN = function(v) { width = max(sapply(names(v),nchar)) paste( sapply(1:length(v),function(i) { paste(format(names(v)[i], width=width), format(v[[i]])) }), collapse='n') } cat(vfmtN(c1)) # coefficients 119 # residuals 18948 # fitted.values 18948 # effects 16071 # R 261 # rank 26 # qr 35261 # family 25160 # linear.predictors 18948 # deviance 30 # aic 30 # null.deviance 30 # iter 26 # weights 18948 # prior.weights 18948 # df.residual 26 # df.null 26 # converged 26 # boundary 26 # call 373 # formula 193 # terms 836 # data 16278 # offset 18 # control 140 # method 37 # contrasts 96 # xlevels 91 cat(vfmtN(c2)) # coefficients 119 # residuals 198949 # fitted.values 198949 # effects 160071 # R 261 # rank 26 # qr 359262 # family 25160 # linear.predictors 198949 # deviance 30 # aic 30 # null.deviance 30 # iter 26 # weights 198949 # prior.weights 198949 # df.residual 26 # df.null 26 # converged 26 # boundary 26 # call 373 # formula 193 # terms 836 # data 160278 # offset 18 # control 140 # method 37 # contrasts 96 # xlevels 91
Look carefully, and you will see that certain objects in the glm
model are large, and growing with data size.
r = c2/c1 cat(vfmtN(r)) # coefficients 1 # residuals 10.49974 # fitted.values 10.49974 # effects 9.960239 # R 1 # rank 1 # qr 10.18865 # family 1 # linear.predictors 10.49974 # deviance 1 # aic 1 # null.deviance 1 # iter 1 # weights 10.49974 # prior.weights 10.49974 # df.residual 1 # df.null 1 # converged 1 # boundary 1 # call 1 # formula 1 # terms 1 # data 9.846296 # offset 1 # control 1 # method 1 # contrasts 1 # xlevels 1 cat(vfmtN(r[r>1])) # residuals 10.49974 # fitted.values 10.49974 # effects 9.960239 # qr 10.18865 # linear.predictors 10.49974 # weights 10.49974 # prior.weights 10.49974 # data 9.846296
Now strictly speaking, all you need to know to apply a glm model are the coefficients of the model, and the appropriate link function. All the other things the glm
model object carries around are for the purpose of characterizing the model. An example would be calculating coefficient significances (and really, for most purposes, one could just calculate the quantities one wants to know, save those, and throw the data away — but we’re here to discuss R as it is, not as it should be). Once you’ve examined a model and decided that it’s satisfactory, all you probably want to do is predict. So let’s try trimming all those large objects away.
cleanModel1 = function(cm) { # just in case we forgot to set # y=FALSE and model=FALSE cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm } cm1 = cleanModel1(mod1) cm2 = cleanModel1(mod2) dTest = synthFrame(100) p1=predict(cm1, newdata=dTest, type='response') # FAILS # Error in qr.lm(object) : lm object does not have a proper 'qr' component. # Rank zero or should not have used lm(.., qr=FALSE).
Ooops. We can’t null out the qr
member of the model object if we want to predict. Incidentally, this is related to the observation that if you try to call lm(...., y=FALSE, model=FALSE, qr=FALSE)
, the result is a model object that fails to either predict or summarize. Don’t ask me why qr=FALSE
is even an option. But back to the glm. What’s in the model’s qr
field?
breakItDown(mod1$qr) # qr rank qraux pivot tol # 35042 26 46 34 30 breakItDown(mod2$qr) # qr rank qraux pivot tol # 359043 26 46 34 30
It turns out that we don’t actually need model’s qr$qr
to predict, so let’s trim just that away:
cleanModel2 = function(cm) { cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm } # More reduction in model size length(serialize(mod2, NULL)) # [1] 1701600 cm2 = cleanModel2(mod2) length(serialize(cm2, NULL)) # [1] 27584 # And prediction works, too resp.full = predict(mod2, newdata=dTest, type="response") resp.cm = predict(cm2, newdata=dTest, type="response") sum(abs(resp.full-resp.cm)) # [1] 0
Are we done?
getModelSize = function(n) { data = synthFrame(n) model = cleanModel2(glm(y~xN+xC,data=data, family=binomial(link='logit'), y=FALSE, model=FALSE)) length(serialize(model, NULL)) } size2 = sapply(ndata, FUN=getModelSize) ggplot(data.frame(n=ndata, modelsize=size2), aes(x=n, y=modelsize)) + geom_point() + geom_line()
The models are substantially smaller than when we started, but they still grow with training data size.
A rough explanation for this is that glm
hides pointers to the environment and things from the environment deep in many places. We didn’t notice this when we built models in the global environment because all those pointers pointed to the same things, so even though the models are much bigger than they need to be, they are all “too big” by the same amount, and hence don’t appear to grow as the training data grows. But when you build the models in a function (as we did in getModelSize()
, you get more transient environments that are proportional to the size of the training data — and so model size grows with training data size. This isn’t going to seem clear, because it depends on a lot of complicated implementation details (for a taste of how complicated it can get, see here).
After much trial and error, this is the set of fields and attributes of the model that we found were growing with data size, and that we could eliminate without breaking predict()
.
stripGlmLR = function(cm) { cm$y = c() cm$model = c() cm$residuals = c() cm$fitted.values = c() cm$effects = c() cm$qr$qr = c() cm$linear.predictors = c() cm$weights = c() cm$prior.weights = c() cm$data = c() cm$family$variance = c() cm$family$dev.resids = c() cm$family$aic = c() cm$family$validmu = c() cm$family$simulate = c() attr(cm$terms,".Environment") = c() attr(cm$formula,".Environment") = c() cm } getModelSize = function(n) { data = synthFrame(n) model = stripGlmLR(glm(y~xN+xC,data=data, family=binomial(link='logit'), y=FALSE, model=FALSE)) length(serialize(model, NULL)) } size3 = sapply(ndata, FUN=getModelSize) ggplot(data.frame(n=ndata, modelsize=size3), aes(x=n, y=modelsize)) + geom_point() + geom_line()
Yahoo! It worked! The models are constant size with respect to training data size. And prediction works.
cm2 = stripGlmLR(mod2) resp.full = predict(mod2, newdata=dTest, type="response") resp.cm = predict(cm2, newdata=dTest, type="response") sum(abs(resp.full-resp.cm)) # [1] 0
Comparing the size of the final stripped-down models (in variable size3
in the demonstration code) to the originals (size1
), we find that the final model is 3/10th of a percent the size of the original model for small (n=5000) training data sets, and 0.015% the size of the original model for “large” (n=100,000) data sets. That’s a heckuva savings. And we probably haven’t gotten rid of all the unnecessary baggage, but at least we seem to have stopped the growth. This was enough trimming to accomplish our task for the client (producing working models that stored and loaded quickly), so we stopped.
One point and one caveat. You can null out model$family
entirely; the predict
function will still return its default value, the link value (that is, predict(model, newdata=data))
will work). However, predict(model, newdata=data, type='response')
will fail. You can still recover the response by passing the link value through the inverse link function: in the case of logistic regression, this is the sigmoid function, sigmoid(x) = 1/(1 + exp(-x))
. I didn’t test type=terms
.
The caveat: many of the other things besides predict that you might like to do with a glm model will fail on the stripped-down version: in particular summary()
, anova()
and step()
. So any characterization that you want to do on a candidate model should be done before trimming down the fat. Once you have decided on a satisfactory model, you can strip it down and save it for use in future predictions.
- You can trim
lm
andgam
models in a similar way, too. The exact fields to trim are a bit different. We will leave this as an exercise for the reader. - We are aware of the
bigglm
package, for fitting generalized linear models to big data. We didn’t test it, but I would imagine that it doesn’t have this problem. Note, though that the problem here isn’t the size of the training data per se (which is only of moderate size); it’s the inordinate size of the resulting model.
Nina Zumel
Data scientist with Win Vector LLC. I also dance, read ghost stories and folklore, and sometimes blog about it all.
Awesome. This is a good time to bring up “The six stages of debugging”:
1) That can’t happen.
2) That doesn’t happen on my machine.
3) That shouldn’t happen.
4) Why does that happen?
5) Oh, I see.
6) How did that ever work?
(Apologies to Kübler-Ross)
I’ve encountered a similar problem with lm() performance. I have a technique I use where I fit thousands of small lm() models on KNN selected local manifolds. It works really well on certain kinds of problems, and has the advantage of creating results which can be understood by the analyst.
I originally deployed it in lisp, and as regression fitting is just a matrix invert, it didn’t add any noticable overhead. When I tried the same trick on an R contract using lm(), it was beastly slow. Lots of the cruft that comes along with an lm is apparently computationally intensive. I’m assuming all such cruft is occasionally useful, but it is a bit ridiculous. Fortunately, lm.fit wasn’t so bad, so I didn’t have to resort to lapack calls, but it was one of those examples of R trying to do everything, and ending up awfully bloated.
@Scott Locklin We’ve had various moments of “is R leaking memory?” in the past. Never ran it down, then Nina saw this one stick out like a sore thumb during a save. She tried
object.size()
, but it was hiding things.Absolutely wonderful post. Thank you for taking the time to thoroughly dissect the glm object in R and show people where the pain points are, and to find a way to strip a GLM object down to the minimal necessary for prediction. This is a really valuable contribution!
@Jared Knowles Thanks for your comment! I’m glad you appreciate the post; it was a bit of work to track all that hidden fat down, and I hope that other people find it helpful for their work, as well.
This begs the question of why the model object is so ‘obese’ in the first place. My understanding is that glm(), lm() and friends are really for interactive use when people are exploring things, and not really suitable for, e.g. use within functions or loops for other more automatic computations (of the kind in which SAS excells, of course, …) In such cases the R philosophy is to create a model object that is as complete and informative as possible, allowing maximum interrogation and use afterwards. R objects are vehicles for information.
When this obesity becomes a problem, my preference is not to pare down the fitted model object, but not to create it in the first place. Rather, I prefer to use low level tools that just to the job in hand, such as model.matrix(), qr() and/or lm.fit() for linear models, glm.fit() for GLMs and low-level matrix tools for standard operations. It is pretty easy to build up a special purpose model fitting function that returns a stripped down object, and the tools are still there.
I should also acknowledge, though, that this is a very useful post in demonstrating the structure of the bog standard fitted model objects and alerting people to what is going on, often without their full knowledge. With many R functions it is good to look under the bonnet (or ‘hood’ in the USA) sometimes – that’s one reason you have the code – as well as at the help information, (which practically no one understands without a legal training, I know…). My take on what to do about it, though, is a bit different.
@Bill Venables Thanks for your comment!
I definitely think it R’s fault for building/leaving so much cruft on the standard call. I like your advice to try the
.fit()
forms, but I am not sure that is always convenient (it turns out you run into some complications).lm.fit()
andglm.fit()
are possibilities, but you lose the convenience of having a ready to gopredict()
function which might be inconvenient if you have data that includes factor valued columns (you need to pass the factor encoding from train to test somehow to make sure you are using the same level encodings). So you are going to want to use thefamily
and possibly some other slots from the returned object. And it turns out the serialization size of the list returned byglm.fit()
also grows proportional to the training data size (try the new code at the bottom of here). So you are again back at cherry picking which fields to save or which ones to excise.Also in packages like
mgcv
you see documentation such as: “gam.fit {mgcv} … is an internal function of package mgcv.” Which really makes you wary of calling it directly.Just two quick points in reply.
One person’s “cruft” can be another’s “craft”. The focus of the discussion here has been on the predict() method. That’s probably the most important, but I am not at all sure people will never write useful methods for fitted model objects that require some of the “cruft” for which we can not yet see a use. There is a need to be balanced about this, and I concede that the objects, when created under the default settings, may have gone a bit overboard, but the philosophy of R favours erring on the side of completeness in this context.
Second point is a query to ponder: Who is this “R” whose fault it is? It is in anyone’s hands to write a package, say GLM, which fits a model and returns a stripped down object suitable ONLY for prediction, and I think it would be very useful contribution. We are all “R”, when you think about it.
(Disclaimer: I didn’t write any of the lm or glm software, but I know and respect the people who did.)
Again: this has been a useful post and an even more useful discussion.
@Bill Venables We are probably are on the same side on this one, just my emphasis is different than the tone I think I am reading from your comments.
At some point design trade-offs are so out of balance that we can actually call them wrong. It doesn’t mean the developers aren’t great. But just because they are great developers doesn’t mean there isn’t a flaw. Having structures proportional to data set size in a linear or generalized linear model (and no prominent documentation of this, or built in way to turn them off) violates the principle of least astonishment. Remember good statistical software is both good statistics (so it must follow good statistical practices) and good software (so it must follow good software engineering practices).
There are obvious functions that do currently use all the stored data (so I know it can’t be fixed):
predict(model)
(which really is not an improvement overpredict(model,newdata=data)
).And in some sense Nina has already supplied code that strips things down to work for prediction.
Another thing R users may not know:
step()
takes time proportional to the data set size even forlm()
. The reason this is a surprise is it it is possible (forlm()
at least) to perform all the calculations needed for a stepwise regression from the summariesX^T X
andX^T y
(whereX
is the so-called design matrix andy
are the training outcomes). These summaries are of a size that is independent of the number of training rows (their size is a function only of the number of columns which come from variables and levels). Probably the original data is being used to drive re-fits (which is probably necessary for a method general enough to work for bothlm()
andglm()
). This mean forlm()
one could write a faststep package (though not everybody likesstep()
, though we could at least add L2 regularization at this point). The graph of the step timings is here: .The timing code is here. Obviously this kind of thing is what the biglm packages is for, but it would be nice if the base system lm package had dome some of these things.