Menu Home

Software Dependencies and Risk

Dirk Eddelbuettel just shared an important point on software and analyses: dependencies are hard to manage risks.

If your software or research depends on many complex and changing packages, you have no way to establish your work is correct. This is because to establish the correctness of your work, you would need to also establish the correctness of all of the dependencies. This is worse than having non-reproducible research, as your work may have in fact been wrong even the first time.

Low dependencies and low complexity dependencies can also be wrong, but in this case there at least exists the possibility of checking things or running down and fixing issues.

This one reason we at Win-Vector LLC have been working on low-dependency R packages for data analysis. We don’t intend on controlling the whole analysis stack (that would be unethical), but we do intend to be in good position to fix things for our partners and clients. The bulk of our system’s utility comes from external systems such as R itself, the data.table package, and Rcpp. So we must (and hopefully do) give credit and thanks.

Also, not all dependencies are equal. So we have had to avoid some popular packages with unstable APIs (a history of breaking changes) and high historic error rates (a history of complexity and adding features over fixing things).

Again, dependencies are but one measure of quality and at best an approximation. But let’s take a look at some of our packages through this lens.

And almost to make the point our package where we relaxed the above discipline right now has CRAN-flagged issue (“significant warnings”), that we can not fix as the issue is in fact from one of the dependencies.

WVPlots

NewImage

The issue is likely from ggplot2, which itself is likely picking up issues and errors from dplyr, tibble, and rlang (a few of ggplot2‘s dependencies that currently have detected, yet unfixed issues on CRAN). And these packages are likely picking up issues from their direct and indirect dependencies.

Now these issues are probably not serious, as if they were there would be a great panic motivating teams to fix them (this is a neat example of survivorship bias, visible acute problems attract enough attention to be fixed quickly- but often subtle chronic issues can live a long time). But the point is: we have no lever to fix them on our end.

Categories: Opinion Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

4 replies

  1. Avoidance of dependencies can encourage rewriting existing functionality in a package. A simple example of this is that there are many packages that independently implement the geometric mean (each with different quirks). The low-dependency philosophy encourages this duplication.

    1. Definitely a valid point. Everything is a trade-off in a world of risks and constraints.

      Duplication/re-invention and incorrect re-invention are risks of the low-dependency discipline.

      The organizing idea I have found to try and counter this is to try and maintain a distinction between service packages (which seem to benefit from being low dependency) and applications (which are more user facing).

      Applications (even if they are distributed as packages) tend to need to be high dependency (either through bringing in a lot of packages on their own or bringing in a lot of packages indirectly). I would say examples of packages that look like applications (or are nearly applications) include Shiny and our own WVPlots example.

      Service packages should do one family of things well and in addition to having low dependencies also have a small set of services. Rcpp, and rquery are examples. wrapr is more of an exception as it is a zero-dependency service package, but it’s services are fairly broad (though having a theme defined as wrapping R language features).

      There may be no hard and fast rule. I’d say a driving consideration is how much choice is left to the package user. Is the user forced to take in a lot of functionality they do not use? Are a lot of other packages rendered incompatible by bringing in a package (or even worse a meta-package or package of packages)? Have the packages been historically API-stable and reliable?

  2. Differentiating service vs application packages makes this much more logical, though as you point out, it’s a spectrum not discrete categories.

    The other thing that I think is usually beneficial is for service packages to hide their dependencies from the user. That way, if a dependency package goes away or makes a breaking change, the service package maintainer can replace the underlying requirement without affecting users if the service package.

%d bloggers like this: