Or put it another way: as R is a typical “the reference implementation is the specification” programming environment there is no true “de jure” R, only a de facto R.
To look at popular R packages I defined “popular” as used (Depends/Imports/LinkingTo) by other packages on CRAN. One could use other definitions (e.g. Github stars), but this is the one I used for this particular study.
My “quick look” (sure to anger everyone) is a couple of diagrams such as the following.
In the above diagram each node is an R package. We are restricting our selves to popular extension packages: those that have at least 1,000 indirect uses in CRAN (via Depends/Imports/LinkingTo; excluding base packages such as stats and utils).
Each node is annotated with:
- The name of the package.
- How many packages directly use the package (directly through Depends/Imports/LinkingTo), written as a count and a fraction of CRAN.
- How many packages indirectly use the package (through chains of Depends/Imports/LinkingTo), written as a count and a fraction of CRAN.
For example: data.table is directly used by 553 packages (3.9% of CRAN) and indirectly used by 1418 packages (10% of CRAN).
The arrows point from packages to packages that use them: use “points in.”
The nodes are color coded as follows:
- Green “singleton”: the package does not use another popular package, and is not used by any popular packages. Example: data.table.
- Orange “source”: the package does not use another popular package, but is used by other popular packages. Example: sys; sys is imported by askpass.
- Fuchsia “mixed”: the package is used by other popular packages, and uses other popular packages. Example akspass.
- Blue “sink”: the package is not used by other popular packages, but uses other popular packages. Example openssl; openssl uses askpass.
The thing to notice is: indirect usage count moves back across arrows. The 1004 usages recorded in openssl must be in the 1014 of askpass, which in turn are in the 1021 of sys.
Now that we have how to read the diagram lets expand from our sample of six popular packages to all popular packages (under our arbitrary 1000 uses==popular definition).
(large version; svg vector version)
(Note the graph is acyclic, any appearance of cycles (such as between pillar and tibble) is due to unfortunate plotting superimposition (pillar Imports rlang, not tibble).)
Some things to notice from the above diagram (counts can be confirmed here).
- dplyr has twice the direct usage as data.table, but both packages have similar indirect usage (each powering about 10 to 12 percent of CRAN).
- Rcpp and then lattice have the largest indirect reach.
- Rcpp and then ggplot2 have the largest direct reach.
- Blue and Green nodes pick up their popularity the hard way by collecting many small (non-popular) packages uses (either directly or indirectly). Orange and Fuchsia nodes may pick up their popularity by association with popular package, or by the hard way in-spite of the association.
- Some packages are “king makers”: that is have high reach (direct or indirect) and use many other packages: thus donating their reach back across the incoming arrows. King makers include ggplot2, tibble, dplyr, scales, pillar, stringr, and tidyselect. The idea is being used by a popular package immediately grants popularity.
- Some packages are “royal court”, used by many other popular packages (and hence can pick up count quickly). Royal court includes lattice, rlang, Rcpp, crayon, assertthat, pkgconfig, glue, magrittr and many others. The diversity of these packages show there are different royal courts. However, we can not read immediately from the diagram if royal court packages also have great indirect reach on their own (though it is obvious which have large direct reach).
- A lot of the Green packages are programming utility packages such as base64enc, mime, and codetools. In fact some of these are “priority recommended” (which I think guarantees they are available in an R install).
- These results are different from pagerank style results. For example tidyr scores high in package-pagerank calculations, but isn’t in this graph.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.