Menu Home

Let’s Have Some Sympathy For The Part-time R User

When I started writing about methods for better "parametric programming" interfaces for dplyr for R dplyr users in December of 2016 I encountered three divisions in the audience:

  • dplyr users who had such a need, and wanted such extensions.
  • dplyr users who did not have such a need ("we always know the column names").
  • dplyr users who found the then-current fairly complex "underscore" and lazyeval system sufficient for the task.

Needing name substitution is a problem an advanced full-time R user can solve on their own. However a part-time R would greatly benefit from a simple, reliable, readable, documented, and comprehensible packaged solution.

Background

Roughly I suggested two possible methods for making the task easier:

  • Renaming views for data.frames. I have now implemented the idea as a call-scoped concept in replyr::replyr_apply_f_mapped() ("call-scoped", meaning the re-mapping lasts for the duration of a function call).
  • Symbol re-binding by a block-scoped command called let() (a common functional notation; and "block-scoped" meaning the re-mapping lasts for the span of a code-block). I released this solution to CRAN and publicly announced it on December 8 2016.

I mention dates to point out that this is something I have been inviting public comment on for some time.

Things change. Since the above time:

  • The development version of dplyr incorporated a new rlang/tidyeval package (probably around February 14th 2017).
  • rlang/tidyeval was released to CRAN on May 2017. Obviously rlang/tidyeval had been under development for some time, but I don’t think the parametric aspect of it was publicly discussed much before February 16, 2017 (notice that a formula centric interface was still being contemplated).
  • dplyr 0.7.0 was released, based on rlang/tidyeval (June 9th, 2017).
  • dplyr excised direct use of lazyeval.
  • The dplyr "underscore verbs" (or methods) were all deprecated (i.e., no longer advised).

The rlang/tidyeval strategy is to capture un-evaluated user expressions (as a new object called a "quosure") and evaluate them with new language rules (with new bindings and something called an "overscope"). Also note the rlang/tidyeval strategy is full integration or re-writing of packages in terms of rlang/tidyeval; this isn’t something you mix-in or turn on or off.

Some points I think that have been under-represented in previous discussions include:

  • Not all R users consider themselves to be expert programmers (many are happy calling themselves analysts or statisticians).
  • R is often used in collaborative projects where there are varying levels of programming expertise.

The second point I think is particularly interesting. It means:

An R user who does not consider themselves an expert programmer could be maintaining code that they understand, but could not be expected to create from scratch.

Or:

Let’s have some sympathy for the part-time R user.

This is the point we will emphasize in our new example.

The example

The design and discussion of substitution solutions should be driven from concrete realistic use cases. Working from larger examples gives us a taste of what working with each solution is like in practice. So, let’s pretend to discuss social science (instead of programming).

Suppose an analyst, psychologist, medical doctor, or scientist is building an assessment for some aspects of behavior and anxiety.

Often such assessments involve selecting moving through a multiple-choice questionnaire and collecting a number of points that depend on answers selected. One such assessment is the Generalized Anxiety Disorder 7 questionnaire (or GAD-7). It is a very simple system as can be seen below.

GAD708 19 08Cartwright

One can treat such a test score as a classifier and assess it in terms of sensitivity, specificity, and different correspondence measures.

An obvious extension of such tests is to give a different number of points in different categories for each multiple-choice answer. For example we could imagine such a test where each answer gave a varying number of points in one of two categories called "withdrawal behavior" and "positive re-framing" (both in the sense of coping behaviors).

For example, our scientist might record the results of two subjects taking a test as follows:

d <- data.frame(
  subjectID = c(1,                   
                1,
                2,                   
                2),
  surveyCategory = c(
    'withdrawal behavior',
    'positive re-framing',
    'withdrawal behavior',
    'positive re-framing'
  ),
  assessmentTotal = c(5,                 
                      2,
                      3,                  
                      4),
  stringsAsFactors = FALSE
)
  
print(d)
##   subjectID      surveyCategory assessmentTotal
## 1         1 withdrawal behavior               5
## 2         1 positive re-framing               2
## 3         2 withdrawal behavior               3
## 4         2 positive re-framing               4
# or in "wide form":
library("cdata")
moveValuesToColumns(d, 
                    columnToTakeKeysFrom = 'surveyCategory',
                    columnToTakeValuesFrom = 'assessmentTotal',
                    rowKeyColumns = 'subjectID')
##   subjectID positive re-framing withdrawal behavior
## 1         1                   2                   5
## 2         2                   4                   3

A natural question is: how does one assign weights to each answer? One way would be to administer the test to a number of people the experimenter has classified as having either of the above mentioned behaviors and then performing a logistic regression to map assessment answers to the probability of a given diagnosis for this population. By re-scaling the weights and rounding them to small integers we could have a test point system that is very close to performing a logistic regression classification. We may be able to use the same assessment questions in a much more decisive manner than assigning all questions the same number of points.

This sort of idea is what one would expect from a mixed and collaborating team that includes medical experts, statistics experts, and programmers. After some work our team might work out that scoring the assessment can be done by the simple R dplyr pipeline:

suppressPackageStartupMessages(library("dplyr"))

scale <- 0.237
  
d %>%
  group_by(subjectID) %>%
  mutate(probability =
           exp(assessmentTotal * scale)/
           sum(exp(assessmentTotal * scale)))
## # A tibble: 4 x 4
## # Groups:   subjectID [2]
##   subjectID      surveyCategory assessmentTotal probability
##       <dbl>               <chr>           <dbl>       <dbl>
## 1         1 withdrawal behavior               5   0.6706221
## 2         1 positive re-framing               2   0.3293779
## 3         2 withdrawal behavior               3   0.4410258
## 4         2 positive re-framing               4   0.5589742

For each subject we take the row with maximal probability as the diagnosis. The diagnosis was already obvious from the original scores, the main addition is the diagnosis confidence is now available as a probability estimate.

Each step of the above pipeline is learn-able:

  • The group_by() is arranging all rows associated with the same subject to work together in later calculations.
  • the exp(assessmentTotal * scale)/sum(exp(assessmentTotal * scale)) is the classic "sigmoid link" from logistic regression. It is the standard way (once you know it) of turning a free-score into a probability estimate.

Suppose this assessment is tested and works well. It is then plausible that the team might ask their R expert to help them construct a much more complicated dplyr pipeline that better formats the results. Under the Harlan Mills’ "Surgical Team" proposal (made famous in Frank Brook’s The Mythical Man Month) we expect effective data science teams to have a diversity of deep expertise (not everybody know everything, but a lot is known by the total team). We expect a well staffed research team to include the statistician who worked out the sigmoid transform above, and a programmer who works out the pipeline we give below.

d %>%
  group_by(subjectID) %>%
  mutate(probability =
           exp(assessmentTotal * scale)/
           sum(exp(assessmentTotal * scale))) %>%
  arrange(probability, surveyCategory) %>%
  mutate(isDiagnosis = row_number() == n()) %>%
  filter(isDiagnosis) %>%
  ungroup() %>%
  select(subjectID, surveyCategory, probability) %>%
  rename(diagnosis = surveyCategory) %>%
  arrange(subjectID)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

This is indeed a long (and expert-level) pipeline. But the principle is:

  • It does useful work (concentrates down to the rows we want and ensures good presentation column names and sorting).
  • While a part-time R user would not be expected to come up with it, they could (with cooperation from the pipeline author) understand all the steps and safely use the pipeline in their project.
  • The application (which we spent some time describing) is what the team cares about, the pipeline is a ends to a means (so even though it is long, it isn’t often the central subject of interest).
  • The longer pipeline is paying the bills, and helping patients. So some pain and cost are to be tolerated.

Let’s take this deliberately long (so as to be a strong test) example and see how hard the pipeline is to re-use under different methodologies.

Re-use

An issue that comes up is: can the team re-use the pipeline on another project? Suppose in their next project the ID column isn’t "subjectID" but it is "patientID" (and so on). Obviously they can copy and paste the original pipeline and change the names (which is not a bad practice for the first few re-uses).

But once this procedure is going to be used many times it is a good idea to wrap it up or genericize it so it can be safely re-adapted (so the users can’t accidentally forget to change one name one place).

I will now walk through a number of approaches to this in terms of how hard they are on the researcher. We are assuming their R expert does the wrapping for them, but then must explain the concepts to the part-time R user so they truly understand and can maintain the tools they are using.

For our example we assume all the column names are coming from variables set somewhere else (in another R script, or coming from a spreadsheet that is read into R, or some other source). The nature of the columns is constant from analysis to analysis, but the exact names used may vary. For our example the column names are:

idCol        <- "subjectID"
categoryCol  <- "surveyCategory"
linkScoreCol <- "assessmentTotal"
indicatorCol <- "isDiagnosis"
probScoreCol <- "probability"
outcomeCol   <- "diagnosis"

wrapr solution

In my opinion the easiest solution (in terms of cognitive load) is wrapr::let(). The R expert would share the following code:

library("wrapr")

let(
  c(
    IDCOL        = idCol,
    CATEGORYCOL  = categoryCol,
    LINKSCORECOL = linkScoreCol,
    INDICATORCOL = indicatorCol,
    PROBSCORECOL = probScoreCol,
    OUTCOMECOL   = outcomeCol
  ),
  
  d %>%
    group_by(IDCOL) %>%
    mutate(PROBSCORECOL =
             exp(LINKSCORECOL * scale)/
             sum(exp(LINKSCORECOL * scale))) %>%
    arrange(PROBSCORECOL, CATEGORYCOL) %>%
    mutate(INDICATORCOL = row_number() == n()) %>%
    filter(INDICATORCOL) %>%
    ungroup() %>%
    select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>%
    rename(OUTCOMECOL = CATEGORYCOL) %>%
    arrange(IDCOL)
)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

The concept is:

"let() works as if you had written the code with the names substituted as shown in the c() block."

And there is ample documentation showing how this can be used. Notice creating this code is completely mechanical (replace concrete names with the all-caps place holders) and the execution has an easy mental model (the place-holders are replaced with names stored in the variables).

In this solution the adapted code looks like the original code.

replyr solution

The next easiest method in concept is replyr_apply_f_mapped().

The R expert would write the following, and the part-time R user (with some coaching) could maintain it.

library("replyr")

d %>%
  replyr_apply_f_mapped(
    nmap = c(
      IDCOL        = idCol,
      CATEGORYCOL  = categoryCol,
      LINKSCORECOL = linkScoreCol,
      INDICATORCOL = indicatorCol,
      PROBSCORECOL = probScoreCol,
      OUTCOMECOL   = outcomeCol
    ),
    
    f = . %>%
      group_by(IDCOL) %>%
      mutate(PROBSCORECOL =
               exp(LINKSCORECOL * scale)/
               sum(exp(LINKSCORECOL * scale))) %>%
      arrange(PROBSCORECOL, CATEGORYCOL) %>%
      mutate(INDICATORCOL = row_number() == n()) %>%
      filter(INDICATORCOL) %>%
      ungroup() %>%
      select(IDCOL, CATEGORYCOL, PROBSCORECOL) %>%
      rename(OUTCOMECOL = CATEGORYCOL) %>%
      arrange(IDCOL)
  )
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

What the code does is exactly this:

  • It renames all of the columns in the data.frame to have the chosen names (in this case the all-caps names).
  • It then applies the user-supplied function f() to this data.frame.
  • The reverse of the name-mapping is applied to the result of f(), moving columns back to their original names.

The concept is:

replyr_apply_f_mapped() renames columns and back.

Below is an illustrative example showing the column names seen inside and outside the user supplied function.

print(colnames(d))
## [1] "subjectID"       "surveyCategory"  "assessmentTotal"
d %>%
  replyr_apply_f_mapped(
    nmap = c(
      IDCOL        = idCol,
      CATEGORYCOL  = categoryCol,
      LINKSCORECOL = linkScoreCol,
      INDICATORCOL = indicatorCol,
      PROBSCORECOL = probScoreCol,
      OUTCOMECOL   = outcomeCol
    ),
    
    f = function(df) {
      df$PROBSCORECOL <- 1
      print(colnames(df))
      return(df)
    }
  ) %>%
  colnames()
## [1] "IDCOL"        "CATEGORYCOL"  "LINKSCORECOL" "PROBSCORECOL"

## [1] "subjectID"       "surveyCategory"  "assessmentTotal" "probability"

This is teachable and something the part-time R user can correctly extend and maintain. Though the user may possibly need to learn about wrapping a pipeline as an anonymous function (the ". %>%" notation).

rlang/tidyeval solution

For the rlang/tidyeval solution the expert writes the following code:

IDSYM        <- rlang::sym(idCol)
CATEGORYSYM  <- rlang::sym(categoryCol)
LINKSCORESYM <- rlang::sym(linkScoreCol)
INDICATORSYM <- rlang::sym(indicatorCol)
PROBSCORESYM <- rlang::sym(probScoreCol)
OUTCOMESYM   <- rlang::sym(outcomeCol)

d %>%
  group_by(!!IDSYM) %>%
  mutate(!!PROBSCORESYM :=
           exp((!!LINKSCORESYM) * scale)/
           sum(exp((!!LINKSCORESYM) * scale))) %>%
  arrange(!!PROBSCORESYM, !!CATEGORYSYM) %>%
  mutate(!!INDICATORSYM := row_number() == n()) %>%
  filter(!!INDICATORSYM) %>%
  ungroup() %>%
  select(!!IDSYM, !!CATEGORYSYM, !!PROBSCORESYM) %>%
  rename(!!OUTCOMESYM := !!CATEGORYSYM) %>%
  arrange(!!IDSYM)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

Several points have to be taught to the part-time R user if this code is to be maintained:

  • The "!!" symbol does not have the same operator precedence as an assignment symbols such as "=" or ":=", so you must often place "!!"-expressions in extra parentheses.
  • In any assignment we must use ":=" for assignment when using "!!" on the left-hand side of the assignment.

The above are just some syntax edge-cases, we haven’t even gone into teaching rlang::sym(), "!!", and the theory and semantics of quasi-quotation.

seplyr solution

seplyr is an experiment to see what a referentially transparent (or completely value oriented) interface to dplyr would look like. Please don’t think of seplyr as an adapter (though it is, it sends all work to dplyr), but as an illustration of what a completely value-oriented dplyr might look like (i.e., one that did not capture un-evaluated user code through non-standard evaluation). Roughly seplyr is an experiment of the form: "what if one tried harder with something like the new dplyr::*_at() verbs."

Most of the seplyr methods are named *_se() and are designed to be very similar to their dplyr equivalents (and some are nearly identical to dplyr::*_at() methods, rename_se() being a notable exception).

library("seplyr")
suppressPackageStartupMessages(library("glue"))

d %>%
  group_by_se(idCol) %>%
  mutate_se(probScoreCol :=
           glue('exp({linkScoreCol} * scale)/
                  sum(exp({linkScoreCol} * scale))')) %>%
  arrange_se(c(probScoreCol, categoryCol)) %>% 
  mutate_se(indicatorCol := "row_number() == n()") %>%
  filter_se(indicatorCol) %>%
  ungroup() %>%
  select_se(c(idCol, categoryCol, probScoreCol)) %>%
  rename_se(outcomeCol := categoryCol) %>%
  arrange_se(idCol)
## # A tibble: 2 x 3
##   subjectID           diagnosis probability
##       <dbl>               <chr>       <dbl>
## 1         1 withdrawal behavior   0.6706221
## 2         2 positive re-framing   0.5589742

The concept is:

"Only mutate needs non-standard evaluation."

seplyr accepts general expressions many more places, but with proper organization and using a few temp-columns you really only need the full generality in mutate().

seplyr has its own issues:

  • It also needs a ":=" operator for assignment.
  • It insists on multiple arguments coming in as vectors (hence the use of "c()" throughout).
  • It runs into a bit of trouble with verbs that take expressions (mutate_se() being the most complicated) in that it needs a helper to substitute in the name of the variable holding the column name, which is later substituted out for the actual column name by seplyr. In this example we used glue::glue() to perform the substitution, but we could also try paste0() or gsub().

The lesson from seplyr is the mutate() verb does indeed need some kind of expression manipulation tooling (direct string manipulation feeling too crude). However, for the rest of the verbs the value oriented notation is in fact quite natural, and really in no sense inferior to the dplyr originals.

Conclusion

Name substitution is a reasonable need that arises when re-using R work or when trying to iterate of column names. I have been publicly exploring variations of substitution systems so that R users can make an informed choice of one or more that most meets their needs and addresses their personal trade-offs between: power, safety, readability, and teachability. These sections are not each independent "yet another way of performing name substitution", but parts of a public conversation that should be had before name substitution is considered settled and fixed in stone.

A part-time R user will not have the background to quickly compare all of the available substitution systems. In fact such a user will only come to need a substitution system when they have a problem. So by definition they are in in the middle of some other task. It is up to expert partners to evaluate and explain alternatives.

There is a temptation that if you are going to only teach one system it might as well be rlang/tidyeval as "that is what now comes with dplyr". I feel this is a false savings as while rlang/tidyeval "is already in dplyr" the rlang/tidyeval concepts and details are not "already in the user" (and in fact include a fairly number of irregular exceptions, needing to be taught and memorized).

Our preference is: wrapr::let(). wrapr::let() delivers a lot of (safe) power for a modest amount of cognitive load. Each of the above systems involves different trade-offs and compromises, and we feel one must really try a few in production before having an expert opinion.

Categories: Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

10 replies

  1. This is a good post. I’ve been finding that I use wrapr::let when parametrization can be accomplished with substitutions alone, and tidyeval when something more complex is needed, e.g. when parameters are lists or other data structures and the code needs to manipulate them. Then UQ and UQS come in handy – I could provide an example if that’s of interest. In this brave new world we need to think about best practices across the spectrum, from part-timers to expert library developers (I feel like which one I am changes by the day of the week!)

    I’ve been meaning to write something like replyr_apply_f_mapped for some time, but I never got around to it because I couldn’t figure out how to deal with the renaming potentially clobbering existing columns in the data frame. It looks like you basically went with a sort of implicit namespacing – the function’s formal arguments are in ALLCAPS while the inputs are lowercase or camelCase. Is that the basic idea here?

    1. Thanks Paul.

      I agree: a mixed wrapr::let() tidyeval approach is a good idea (keeping each one to what it does best).

      To get the replyr_apply_f_mapped() renaming to work safely I did two things:

      I suggest the user use an implicit namespace that is separate (i.e. the all-caps). As you implied: is a real issue the user could map into a column they did not know about (as they don’t know all the other column names) and lose data. Only adding a temporary select or renaming every column we are not working with to safe names would fully guard against this, but that workflow would not be very desirable.
      Under the covers replyr_apply_f_mapped() picks a set of names known to be disjoint from both the current column names and the user’s mapping targets, call these “intermediate names.” replyr_apply_f_mapped() then maps in two stages: original names to the intermediate names and then intermediate names to target names. This allows good behavior and safe permutations of names. I deliberately don’t call this out, as it seems complicated to the new user. However, I think the entailed semantics are really wonderful.

      1. I like replyr_apply_f_mapped, it seems advantageous over wrapr::let in some respects. I had to debug a wrapr::let code block at one point and the traceback was weird – it seemed to be referencing a temporary source file created by the macro substitution process. I imagine with this approach the traceback would have been more standard.

      2. Yes, best-case when debugging a wrapr::let() block is: you end up seeing something that looks like temporary substituted source code (losing line numbers and file-refs in your original). Worst case: could be worse. Not good, but as good as that technique is going to get (without direct debugger support).

        replyr_apply_f_mapped() is actually running the code you typed in (though running it inside a different function environment, which can have its own debugging issues). So debugging can be potentially easier.

        Roughly what is going on is this: there is only so much referential transparency in the world and each of these things (re-mapping and debugging) is consuming some. Each hopes the other has left enough referential transparency on the table to meet its own needs. I have some notes on the concept here and here.

      3. I think it’s fine to suck up some namespace and referential transparency for the sake of dplyr being programmable. dplyr is a HUGE deal. Within 3 years it has basically become *the* DSL of tabular manipulation in R. It’s fair that dplyr and extensions take dibs on some of the namespace and referential transparency to make coding easier.

      4. (I say that with love for data.table, but the fact is I don’t use data.table unless I have to…)

  2. Love the post and the blog.
    At what point do you think it makes sense to turn the pipe chain into a function with named input parameters rather than this method of renaming then un-renaming? It seems the functional approach is more flexible, possibly easier to interpret, and possibly easier to maintain for “part time” R users. Creating custom functions is definitely the approach I would use for more complex operations (multiple pipe chains), so it seems there may be some advantage to using an approach that scales. Thoughts?

    1. Thomas,

      I think you make a very good point.

      My opinion is you should teach function creation early in R (as it is very core idiomatic use of the language). I also think one should wrap these pipelines into functions with value-oriented interfaces (column names in function arguments) very early. Nina Zumel explained this better than I did back when let() was only in the replyr package.

      I also think wrapr::let() is exactly the tool that makes writing such functions easier (some technical notes on the topic here). One problem with my earlier articles is: if I take time to show the complicated alternatives the largely remembers things being complicated (when, in my opinion, the alternative I was proposing was in fact deliberately designed to be simple).

  3. The new string-algebra notation makes the seplyr solution to the “sympathy for the part-time R user” problem even cleaner:

    suppressPackageStartupMessages(library("dplyr"))
    library("seplyr")
    #> Loading required package: wrapr
    
    scale <- 0.237
    
    d <- data.frame(
      subjectID = c(1,                   
                    1,
                    2,                   
                    2),
      surveyCategory = c(
        'withdrawal behavior',
        'positive re-framing',
        'withdrawal behavior',
        'positive re-framing'
      ),
      assessmentTotal = c(5,                 
                          2,
                          3,                  
                          4),
      stringsAsFactors = FALSE
    )
    
    idCol        <- "subjectID"
    categoryCol  <- "surveyCategory"
    linkScoreCol <- "assessmentTotal"
    indicatorCol <- "isDiagnosis"
    probScoreCol <- "probability"
    outcomeCol   <- "diagnosis"
    
    
    seplyr::expand_expr(exp(linkScoreCol * scale)/
                          sum(exp(linkScoreCol * scale)))
    #> [1] "exp(assessmentTotal * scale)/sum(exp(assessmentTotal * scale))"
    
    d %>%
      group_by_se(idCol) %>%
      mutate_nse(probScoreCol := exp(linkScoreCol * scale)/
                       sum(exp(linkScoreCol * scale))) %>%
      arrange_se(c(probScoreCol, categoryCol)) %>% 
      mutate_nse(indicatorCol := row_number() == n()) %>%
      filter_se(indicatorCol) %>%
      ungroup() %>%
      select_se(c(idCol, categoryCol, probScoreCol)) %>%
      rename_se(outcomeCol := categoryCol) %>%
      arrange_se(idCol)
    #> # A tibble: 2 x 3
    #>   subjectID           diagnosis probability
    #>       <dbl>               <chr>       <dbl>
    #> 1         1 withdrawal behavior   0.6706221
    #> 2         2 positive re-framing   0.5589742
    

    Or a pure “seplyr” version:

    library("seplyr")
    #> Loading required package: wrapr
    
    scale <- 0.237
    
    d <- data.frame(
      subjectID = c(1,                   
                    1,
                    2,                   
                    2),
      surveyCategory = c(
        'withdrawal behavior',
        'positive re-framing',
        'withdrawal behavior',
        'positive re-framing'
      ),
      assessmentTotal = c(5,                 
                          2,
                          3,                  
                          4),
      stringsAsFactors = FALSE
    )
    
    idCol        <- "subjectID"
    categoryCol  <- "surveyCategory"
    linkScoreCol <- "assessmentTotal"
    indicatorCol <- "isDiagnosis"
    probScoreCol <- "probability"
    outcomeCol   <- "diagnosis"
    
    d %.>%
      group_by_se(., idCol) %.>%
      mutate_nse(., probScoreCol := exp(linkScoreCol * scale)/
                   sum(exp(linkScoreCol * scale))) %.>%
      arrange_se(., c(probScoreCol, categoryCol)) %.>% 
      mutate_nse(., indicatorCol := row_number() == n()) %.>%
      filter_se(., indicatorCol) %.>%
      ungroup(.) %.>%
      select_se(., c(idCol, categoryCol, probScoreCol)) %.>%
      rename_se(., outcomeCol := categoryCol) %.>%
      arrange_se(., idCol)
    #> # A tibble: 2 x 3
    #>   subjectID           diagnosis probability
    #>       <dbl>               <chr>       <dbl>
    #> 1         1 withdrawal behavior   0.6706221
    #> 2         2 positive re-framing   0.5589742
    
%d bloggers like this: