Menu Home

Vectorized Block ifelse in R

Win-Vector LLC has been working on porting some significant large scale production systems from SAS to R.

From this experience we want to share how to simulate, in R with Apache Spark (via Sparklyr), a nifty SAS feature: the vectorized “block if(){}else{}” structure.

When porting code from one language to another you hope the expressive power and style of the languages are similar.

  • If the source language is too weak then the original code will be very long (and essentially over specified), meaning a direct transliteration will be unlikely to be efficient, as you are not using the higher order operators of the target language.
  • If the source language is too strong you will have operators that don’t have direct analogues in the target language.

SAS has some strong and powerful operators. One such is what I am calling “the vectorized block if(){}else{}“. From SAS documentation:

The subsetting IF statement causes the DATA step to continue processing only those raw data records or those observations from a SAS data set that meet the condition of the expression that is specified in the IF statement.

That is a really wonderful operator!

R has some available related operators: base::ifelse(), dplyr::if_else(), and dplyr::mutate_if(). However, none of these has the full expressive power of the SAS operator, which can per data row:

  • Conditionally choose where different assignments are made to (not just choose conditionally which values are taken).
  • Conditionally specify blocks of assignments that happen together.
  • Be efficiently nested and chained with other IF statements.

To help achieve such expressive power in R Win-Vector is introducing seplyr::if_else_device(). When combined with seplyr::partition_mutate_se() you get a good high performance simulation of the SAS power in R. These are now available in the open source R package seplyr.

For more information please reach out to us here at Win-Vector or try help(if_else_device).

Also, we will publicize more documentation and examples shortly (especially showing big data scale use with Apache Spark via Sparklyr).

Categories: Coding Pragmatic Data Science

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

3 replies

  1. Just a note to clarify. if_else_device is critical if you are working with sparklyr (where you really have to work using dplyr notation, and must carefully manage dependencies). For in-memory data frames, one does not care as much.

  2. The SAS documentation you quote isn’t appropriate, since it’s talking about a very specific (and honestly bad) feature of the IF statement: a lone IF statement, when not followed by THEN, is taken to mean “IF THEN delete;”. The DATA step stops processing the current record and doesn’t include it in the output.

    The usual IF-THEN/ELSE statements do what you’re talking about. A link to that documentation: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000202239.htm

%d bloggers like this: