We recently commented on excess package dependencies as representing risk in the
R package ecosystem.
The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?
Well, it turns out we can quantify it: each additional non-core package declared as an “Imports” or “Depends” is associated with an extra 11% relative chance of a package having an indicated issue on CRAN. At over 5 non-core “Imports” plus “Depends” a package has significantly elevated risk.
The number of dependent packages in use versus modeled issue probability can be summed up in the following graph.
In the above graph the dashed horizontal line is the overall rate that packages have issues on CRAN. Notice the curve crosses the line well before 5 non-trivial dependencies.
In fact packages importing more than 5 non-trivial dependencies have issues on CRAN at an empirical rate of 35%, (above the model prediction at 5 dependencies) and double the overall rate of 17%. Doubling a risk is considered very significant. And almost half the packages using more than 10 non-trivial dependencies have known issues on CRAN.
A very short analysis deriving the above can be found here.
Obviously we are using lack of problems on CRAN as a rough approximation for package quality, and number of non-trivial package Imports and Depends as rough proxy for package complexity. It would be interesting to quantify and control for other measures (including package complexity and purpose).
Our theory is the imports are not so much causing problems, but are a “code smell” correlated with other package issues. We feel this is evidence that the principles that guide some package developers to prefer packages with well defined purposes and low dependencies are part of a larger set of principles that lead to higher quality software.
A table of all scored packages and their problem/risk estimate can be found here.
Categories: Opinion Programming Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
The above is a crude step away from Deming’s “Without data, you’re just another person with an opinion.” Obviously it is going to be worse than a proposed better study, but it is in fact better than no study.
The best way to improve the above study is better measurements (package age, package intent, complexity, maybe text analysis of descriptions and documentation). But one “simple math” improvement I would love to try is to get around the “not all dependencies are equal” issue by introducing indicator variables so each package can show what set of dependencies it pulled in and then using a hierarchical or partial-pooling technique to attempt to model the risk each package represents when included in others. The math is fairly standard, but the idea isn’t quite compatible with how dummy-variables are introduced in the normal formula/model-matrix calling paths (would likely need to set up the probabilistic system by hand in
For any study (especially an initial one) one naturally wants variations. Here is a variation where we restrict “problem” to ERROR and FAIL and limit the which architectures we check. Notice the result is very similar (it is in fact a robust result). We published our first study (without trying many variations) to avoid venue-shopping or p-hacking. But investigating variations after the fact is a good idea.
Also a quantifiable claim (with what data is in the claim) is a good platform for dispute. It can be very personal to tell somebody their opinion is “wrong”, but is part of science to suggest variations on an analysis.
Is it possible that R could control the version of the packages like Anaconda in python?
There are several version pinning systems: MRAN, packrat, and a few more.