Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.
From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}
” structure.
When porting code from one language to another you hope the expressive power and style of the languages are similar.
- If the source language is too weak then the original code will be very long (and essentially over specified), meaning a direct transliteration will be unlikely to be efficient, as you are not using the higher order operators of the target language.
- If the source language is too strong you will have operators that don’t have direct analogues in the target language.
SAS has some strong and powerful operators. One such is what I am calling “the vectorized block if(){}else{}
“. From SAS documentation:
The subsetting IF statement causes the DATA step to continue processing only those raw data records or those observations from a SAS data set that meet the condition of the expression that is specified in the IF statement.
That is a really wonderful operator!
R has some available related operators: base::ifelse()
, dplyr::if_else()
, and dplyr::mutate_if()
. However, none of these has the full expressive power of the SAS operator, which can per data row:
- Conditionally choose where different assignments are made to (not just choose conditionally which values are taken).
- Conditionally specify blocks of assignments that happen together.
- Be efficiently nested and chained with other IF statements.
To help achieve such expressive power in R Win-Vector is introducing seplyr::if_else_device()
. When combined with seplyr::partition_mutate_se()
you get a good high performance simulation of the SAS power in R. These are now available in the open source R package seplyr.
For more information please reach out to us here at Win-Vector or try help(if_else_device)
.
Also, we will publicize more documentation and examples shortly (especially showing big data scale use with Apache Spark via Sparklyr).
Categories: Coding Pragmatic Data Science
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Just a note to clarify.
if_else_device
is critical if you are working withsparklyr
(where you really have to work usingdplyr
notation, and must carefully manage dependencies). For in-memory data frames, one does not care as much.The SAS documentation you quote isn’t appropriate, since it’s talking about a very specific (and honestly bad) feature of the IF statement: a lone IF statement, when not followed by THEN, is taken to mean “IF THEN delete;”. The DATA step stops processing the current record and doesn’t include it in the output.
The usual IF-THEN/ELSE statements do what you’re talking about. A link to that documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202239.htm
Nathan, thanks for the correction. I wish I had looked a little longer at the documentation before picking my link.