Menu Home

Meta-packages, nails in CRAN’s coffin

Derek Jones recently discussed a possible future for the R ecosystem in “StatsModels: the first nail in R’s coffin”.

This got me thinking on the future of CRAN (which I consider vital to R, and vital in distributing our work) in the era of super-popular meta-packages. Meta-packages are convenient, but they have a profoundly negative impact on the packages they exclude.

For example: tidyverse advertises a popular R universe where the vital package data.table never existed.

NewImage

And now tidymodels is shaping up to be a popular universe where our own package vtreat never existed, except possibly as a footnote to embed.

NewImage

NewImage

Users currently (with some luck) discover packages like ours and then (because they trust CRAN) feel able to try them. With popular walled gardens that becomes much less likely. It is one thing for a standard package to duplicate another package (it is actually hard to avoid, and how work legitimately competes), it is quite another for a big-brand meta-package to pre-pick winners (and losers).

All I can say is: please give vtreat a chance and a try. It is a package for preparing messy real-world data for predictive modeling. In addition to re-coding high cardinality categorical variables (into what we call effect-codes after Cohen, or impact-codes), it deals with missing values, can be parallelized, can be run on databases, and has years of production experience baked in.

Some places to start with vtreat:

Categories: Opinion

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

11 replies

  1. To add insult to injury: there is a related story as to why we are sharing a vtreat working paper, and not a refereed journal article (again, issues around controlling the on-ramps; and to be clear it is one thing to have a paper rejected, quite another to not have it reviewed).

  2. I agree with this. The cross promotion available to the tidyverse/ Rstudio crowd kills any competition from the get go.

    1. I am saying one editor sat on the paper for over a year and then unilaterally rejected it with essentially no reviews (they admitted the one review shared back was factually wrong/unusable, and withheld what later turned out to be a positive review). I assume the other editors had no knowledge of this and no strong opinion one way or another.

      I had actually assumed the first editor was going to be fair. I have had papers rejected, with no complaint on my part. This wasn’t that sort of outcome.

      I know this is nothing anybody (including myself) wants to think of others. But that is one of the facets typical of abuse: it isn’t something that one expects in a nice or fair world, so it is easy to discount.

      1. I’m not able to form an opinion without more information, but anyway, this is something not directly related to the topic at hand.

        About the post, I don’t see how meta-packages may affect CRAN at all. The parallelism with Derek’s post is too forced to somehow present them as inherently and evidently pernicious.

        As I see it, there are people building great software under the same principles/philosophy, covering different parts of a problem. Then they create a shortcut for a workflow that many people are already using. What is the problem? Oh, they don’t include other people’s software… of course not! Why would they do that? It’s not their software, and more importantly, it’s not under these same principles that their users quickly identify and are so familiar with.

        I understand that you may be angry due to that experience with the JSS, but I think you are chasing ghosts in this matter.

      2. Thanks for your polite and considered input.

        We don’t fully agree, but I think we have both clearly stated our opinions. So I am not going to belabor any points, other than to emphasize vtreat is a very high quality tool and compatible with many workflows.

  3. I’ve found several use cases with high cardinality categorical variables and I didn’t know about vtreat (I have used tidymodels and other approaches). I believe the tidyverse has brought a lot of good things to the R ecosystem, but, I also think there are some far better alternatives for some tasks that have been overshadowed (data.table is the clearest example for me). For what I have seen vtreat is really well documented so I’ll give it a try. Having different tools to do similar tasks always brings value to the table.

  4. I too worry about the dominance of the “tidy” everything and the added help they get by being a part of the company that produces the most widely used R IDE. Of course, these are all excellent things (the packages and RStudio) so it isn’t like I want them gone but I wonder if we’re headed towards people looking at this like there are two “flavors” of R that aren’t meant to coexist in single projects. I really hope for some good competition on the IDE front in particular.

    1. I mean, there are two flavours developing, SE and NSE, which can coexist perfectly, but if I’m going to be very honest, when I was young in my R, I got most of the hateful messages about mixing both from the base-R crowd… including one person straight up refusing to review my question on SO because it was too messy (mixing NSE and SE).

      My perspective is that the easier R is to pick up, the better (I even widely promote Radiant, the click and dropdown Shiny app that converts R into a menu based system).

      This isn’t a popularity contest, if vtreat has functionality not found elsewhere, it’ll be used by people looking for “more”.

      1. I agree and strongly endorse mixed coding styles.

        My only point is: things like vtreat are more easily found when they are treated fairly (for example not excluded from lists of options by popular aggregators). It isn’t about popularity or winning: just about fairness and academic honesty.

%d bloggers like this: