I recently got back from Strata West 2017 (where I ran a very well received workshop on
Spark). One thing that really stood out for me at the exhibition hall was
datashader from Continuum Analytics.
I had the privilege of having Peter Wang himself demonstrate
datashader for me and answer a few of my questions.
I am so excited about
datashader capabilities I literally will not wait for the functionality to be exposed in
rbokeh. I am going to leave my usual
rmarkdown world and dust off
Jupyter Notebook just to use
datashader plotting. This is worth trying, even for diehard
Every plotting system has two important ends: the grammar where you specify the plot, and the rendering pipeline that executes the presentation. Switching plotting systems means switching how you specify plots and can be unpleasant (this is one of the reasons we wrap our most re-used plots in WVPlots to hide or decouple how the plots are specified from the results you get). Given the convenience of the ggplot2 grammar, I am always reluctant to move to other plotting systems unless they bring me something big (and even then sometimes you don’t have to leave: for example the absolutely amazing adapter
Currently, to use
datashader you must talk directly to
Bokeh (i.e. learn a different language). But what that buys you is massive: in-pixel analytics. Let me clarify that.
datashader makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them. It is easiest to show by example.
Please take a look at these stills from the
datashader US Census example. We can ask pixels to be colored by the majority race in the region of Lake Michigan:
If we were to use the interactive version of this graph we could zoom in on Chicago and the majorities are re-calculated based on the new scale:
What is important to understand is that is this is vastly more powerful than zooming in on a low-resolution rendering:
and even more powerful than zooming out on a static high-resolution rendering:
datashader can redo aggregations and analytics on the fly. It can recompute histograms and renormalize them relative to what is visible to maintain contrast. It can find patterns that emerge as we change scale: think of zooming in on a grey pixel that resolves into a black and white checkerboard.
An R example
I am going to share a simple
datashader example here. Again, to see the full effect you would have to copy it into an
Jupyter notebook and run it. But I will use it to show my point.
After going through the steps to install
Juputer notebook (plus some more
conda install steps to include necessary packages) we can make a plot of the
ggplot2 data example
ggplot2 renderings of
diamonds typically look like the following (and show of the power and convenience of the grammar):
datashader rendering looks like the following:
If we use the interactive rectangle selector to zoom in on the apparently isolated point around $18300 and 3.025 carats we get the following dynamic re-render:
Notice the points shrunk (and didn’t subdivide) and there are some extremely faint points. There is something wrong with that as a presentation; but it isn’t because of
datashader! It is something unexpected in the data which is now jumping out at us.
datashader is shading proportional to aggregated count. So the small point staying very dark (and being so dark it causes other point to render near transparent) means there are multiple observations in this tiny neighborhood. Going back to
R we can look directly at the data:
> library("dplyr") > diamonds %>% filter(carat>=3, carat<=3.05, price>=18200, price<=18400) # A tibble: 5 × 10 carat cut color clarity depth table price x y z <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> 1 3.01 Premium I SI2 60.2 59 18242 9.36 9.31 5.62 2 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 3 3.01 Fair I SI2 65.8 56 18242 8.99 8.94 5.90 4 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77 5 3.01 Good I SI2 63.9 60 18242 9.06 9.01 5.77
There are actually 5 rows with the exact carat and pricing indicated by the chosen point. The point stood out at fine scale because it indicated something subtle in the data (repetitions) that the analyst may not have known about or expected. The “ugly” presentation was an important warning. This is hands on the data, the quickest path to correct results.
For some web browsers, you don’t always see proper scaling, yielding artifacts like the following:
Jupyter notebooks always work, and web-browsers usually work (I am assuming it is security or ad-blocking that is causing the effect, not a
datashader brings to production resolution dependent per-pixel analytics. This is a very powerful style of interaction that is going to appear more and more places. This is something that the Continuum Analytics team has written about before and requires some interesting cross-compiling (Numba) to implement at scale. Now that analysts have seen this in action they are going to want this and ask for this.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.