Menu Home

Survive R

New PDF slides version (presented at the Bay Area R Users Meetup October 13, 2009).

We at Win-Vector LLC appear to like R a bit more than some of our, perhaps wiser, colleagues ( see: Choose your weapon: Matlab, R or something else? and R and data ). While we do like R (see: Exciting Technique #1: The “R” language ) we also understand the need to defend oneself against the abuse regularly dished out by R. Here we will quickly share a few fighting techniques.

If you are not already using R the following will not mean much. If you are using R this may scratch a few itches.

  • First: Write down everything- keep notes in a separate file.

    When you do figure out how to do something in R it will be concise, powerful and completely un-mnemonic and impossible to find again through the help system.

  • Second: Find some way to search for R answers.

    http://stackoverflow.com/questions/102056/how-to-search-for-r-materials

  • Third: Learn unclass().

    # Here is an example of fitting a linear model (from the help(glm) documentation)
    ## Dobson (1990) Page 93: Randomized Controlled Trial :
    > counts <- c(18,17,15,20,10,20,25,13,12)
    > outcome <- gl(3,1,9)
    > treatment <- gl(3,3)
    > glm.D93 <- glm(counts ~ outcome + treatment, family=poisson())
    
    

    Want to get the model coefficients and don't feel like suffering through the documentation/help system? You can't inspect the glm.D93 object because it has overridden the print() and summary() methods to hide details (in particular you can't find the member data). No problem, type this:

    > model <- unclass(glm.D93)
    

    The model is now a harmless list without a bunch of pesky methods hiding the information.

  • Fourth: learn how to list class and methods.

    Often one of methods(), showMethods() or getS3Method() can show you what methods are on a class or object. Be prepared to try them all as they apply in different contexts.

    # lets make a tricky function
    > fe <- function(x) UseMethod("fe")
    > fe.formula <- function(x) { print('formula')}
    > fe.numeric <- function(x) { print('numeric')}
    

    How will anyone figure out what we have done?

    > class(fe)
    [1] "function"
    
    > methods(fe)
    # [1] fe.formula fe.numeric
    
    > getS3method('fe','numeric')
    # fe.numeric <- function(x) { print('numeric')}
    

  • Fifth: Learn to stomp out attributes.

    Ever have this crud follow you around?

    > m <- summary(c(1,2))[4]
    > m
    Mean 
     1.5 
    

    Ah that’s cute: a little “Mean” tag is following the data around. But what if we try to use this value:

    > m*m
    Mean 
    2.25 
    

    Okay, now the “Mean” tag has outstayed its welcome. The fix:

    > attributes(m) <- c()
    > m*m
    [1] 2.25
    

    MUCH better.

  • Sixth: Swallow your pride.

    My example: does R have map structures? I have no idea and I am too ashamed to ask. However I know I can fake it with environments (which may be “the R way to do this” or may be “a horrible abuse of the language”- I have no idea which).

    > map <- new.env(hash=TRUE,parent=emptyenv())
    > assign('dog',7,map)
    > ls(map)
    [1] "dog"
    > get('dog',envir=map)
    [1] 7
    

    That (nearly) gives you maps with string keys. For maps with numeric keys we can fake something else up with findInterval(). For maps from generic comparable objects keys- I have no idea how you would trick R into helping. This is one reason we like to separate out all data-preparation into a pre-processing step implemented in Java or SQL.

    Note important correction from Eward Ratzer: use “map <- new.env(hash=TRUE,parent=emptyenv()), see comments.

  • Seventh: Find and rely on “the one-liners.”

    Reading in an entire comma separated file in a single line ( read.table() ), re-aggregating data ( table() or doBy’s summaryBy() command ) or building an empirical density ( ecdf() ) in a single line of code is an experience not to be missed.

The overall all point is that while R has some (unnecessarily) sharp edges and pain-points it is a powerful tool worth using. I would much rather struggle through a minor R-language issue when trying to prepare my data than to do without the many special functions, distributions, fitters and plotters built into the R system.

Categories: Uncategorized

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

22 replies

  1. One of my “aha” moments with R was actually while screwing around in Lush. I wrote a time series class, and of course wanted a plot method. Ever try plotting in Matlab? What a barking nightmare. Plotting in R is fairly nice. Plotting in Lush ended up being fairly nice too, once I realized how to do it: you need to pull a lot of cruft like “mean” around with you, and the plot method has to do a lot of work setting things up and making sure the limits are OK.
    I’m pretty sure a lot of R has junk like this; deely boppers which serve no obvious purpose, but which actually fit into a convenient slot somewhere. I kill my cruft with as.vector() most of the time.

    Did I mention there are at least 5 distinct time series classes in R, and they don’t work with each other? Most of them are no good, too: the type of thing which only understands month of year, but of course, most of the useful functions like arima() are written for them, and you can’t cast a fancy TS like XTS into a simple TS without invoking Lucifer.

    My trick for finding a function or package with does *blah* in R now is to type “CRAN *blah*” into the google machine. Same as the trick I used to use when confronted by some hairy Linux issue. After hours and hours of futilely trying to get the help tools to print out something helpful, I came to the conclusion that it’s only useful if you already know the function name. And at that point, just typing the function name is often more edifying, since you might not know which namespace you’re presently calling the function from.

    Fun R complaint of the month: there are no int32 or float32, and if you compile 64 bit, your ints magically become longs. So what? It’s a data language! OK, what happens if you have to dump a bunch of int32’s or float32’s to SQL? Erm, you have to call SQL kludges which do the cast, as far as I can tell.

  2. Hmm,
    2) RSeek.org is also good
    3) uncalss is quite dangerous, as it removes all attribute information. I would just use names(glm.D93) which will print the names of all elements in the list. To get the coefficients from the model coef(glm.D93) works well.
    5) attributes(x)<-c(), no, again dangerous for more complex objects. How about names(m)<-""
    for higher dimensional objects dimnames(x)<-NULL
    6) environments in R are essentially hash tables (hash=TRUE) probably as close to a map as you can get in R

    my2c
    Nicholas

  3. @John Johnson I had not tried str(), I had assumed it was similar to summary() (which is too high-level to help). I just tried str() and dput() and they are really great. Thanks!

  4. jmount :
    @John Johnson I had not tried str(), I had assumed it was similar to summary() (which is too high-level to help). I just tried str() and dput() and they are really great. Thanks!

    To successively slim down the str() output, you might try –

    str(glm.D93)

    str(glm.D93,give.attr=F)

    str(glm.D93,give.attr=F,max.level=1)

  5. To remove attributes from an atomic vector, the recommended behavior is:

    > m names(m) <- NULL

  6. Scott Locklin :

    Fun R complaint of the month: there are no int32 or float32, and if you compile 64 bit, your ints magically become longs. So what? It’s a data language! OK, what happens if you have to dump a bunch of int32’s or float32’s to SQL? Erm, you have to call SQL kludges which do the cast, as far as I can tell.

    What SQL Database are you interfacing with? Should be a simple fix at the C level, used with R’s .Call

  7. @dgerlanc
    I’d rather not mess with the C level (which is one more thing to learn and really breaks portability) just to fix this type problem. R’s db adapter has some issues (like not being able to control what is and is not a factor as you can with read.table()).

  8. Oh – I’m guessing that your blogging software doesn’t like the use of less than / greater than signs. :-( My comment should have recommended using

    map=new.env(hash=T,parent=emptyenv())

    You’ll see why this is needed by running

    get(”map”,envir=map)

    on your first construction.

  9. Minor correction for slide 5 with the “LearnR Example”. Not every figure from lattice is re-created in ggplot2. Most are, but there are numerous caveats along the lines of “ggplot does not support 3D graphics…”

    Another site I find helpful:
    http://addictedtor.free.fr/graphiques/

    print.default() is another way to print objects without using methods.

    Tip 8: Grow thicker skin to withstand the abuse you will receive when posting to r-help.

  10. @Edward Ratzer
    Edward, thanks for the correction. Sorry about the blogging software fighting you- once I saw your statement that you need an empty parent environment I approved your comment without check further, because it is such an important point. But it is even nicer that you have included the code in addition to the warning.

  11. @Scott Locklin

    zoo and xts are now fairly standard for complex timeseries. True, many core functions still operate on ts objects. The conversion doesnt involve the Prince, though!

    Are you sure about xts and arima? Xts is a superset of zoo. and documentation indicates that ***zooregs*** work with arima and brethren with no loss of information. See p6 & 7 of current zoo docs here (obtained via rseek.org, nice tip!) http://r-forge.r-project.org/plugins/scmsvn/viewcvs.php/*checkout*/trunk/inst/doc/zoo.pdf?rev=134&root=zoo

    0. No whinging before reading upper-level documentation. I agree that function help pages can be confusing at first, and that many packages lack vignettes and overview explanations. Still, there are 20 years of books for learning the *language* (counting S), and many very readable books have emerged lately (e.g. Lattice). I consider these, along with great online resources like http://www.burns-stat.com/pages/spoetry.html to be the “M” in that venerably cry of “RTFM”.

  12. Another one that is great once you know it (but took about 2 hours to find):

    > data(Chem97,package=’mlmRev’)
    > subset(Chem97,score %in% c(0,10))

  13. @Kevin Wright
    Kevin wrote:
    “Tip 8: Grow thicker skin to withstand the abuse you will receive when posting to r-help.”

    Please consider an alternative Tip 8 for using the r-help list:
    Read the Posting Guide first, and then post reproducible code with a clear specification of the desired result.


    David.

  14. @David Winsemius
    On the whole I like the R community, but statistics can raise everybody’s tempers. So I do agree that the R community deserves some respect.

    I was trying (and failed) to find the article where one of the fathers of Lisp admitted (using some beautiful language like “I don’t believe this myself, but so many people have told me this I must accept it is true”) that he felt one of the reasons Lisp is being outcompeted by its newer and weaker rivals (Python, Ruby, Arc) was that the Lisp community was rude to beginners. The fact that they were responding curtly to the ill-formed questions formed by beginners had the effect of chasing away beginners (and without beginners you eventually have nobody left).

  15. Tip 8: Sometimes the answer really is in the help (after being let-down a bunch of times by the help system I got out of the habit of checking it, but that isn’t a good idea).

  16. @jmount

    Found it: Dan Weinreb’s blog September 18th, 2008 http://danweinreb.org/blog/the-failure-of-lisp-a-reply-to-brandon-werner :

    I am distressed and sad to hear that the community is judgmental and unfriendly to newcomers and thorny and un-inspiring. I have heard this same criticism from other people than you, and at this point I assume it must really be true. My own point of view is, of course, entirely different from that of newcomers, so it’s probably harder for me to see that this is going on. Indeed, to me it seems that people do get answers on comp.lang.lisp and LispForum, and the tone doesn’t seem so nasty to me, usually. Maybe I’m just not “getting it”.

    A friend’s blog ( http://erehweb.wordpress.com/2009/11/09/gold-and-intrinsic-value/ ) found a similarly amusing quote on a different topic:

    President Ulysses S. Grant:

    “A noun is the name of a thing,” which I had also heard my Georgetown teachers repeat until I had come to believe it.

%d