Menu Home

Datashader is a big deal

I recently got back from Strata West 2017 (where I ran a very well received workshop on R and Spark). One thing that really stood out for me at the exhibition hall was Bokeh plus datashader from Continuum Analytics.

I had the privilege of having Peter Wang himself demonstrate datashader for me and answer a few of my questions.

I am so excited about datashader capabilities I literally will not wait for the functionality to be exposed in R through rbokeh. I am going to leave my usual knitr/rmarkdown world and dust off Jupyter Notebook just to use datashader plotting. This is worth trying, even for diehard R users.

datashader

Every plotting system has two important ends: the grammar where you specify the plot, and the rendering pipeline that executes the presentation. Switching plotting systems means switching how you specify plots and can be unpleasant (this is one of the reasons we wrap our most re-used plots in WVPlots to hide or decouple how the plots are specified from the results you get). Given the convenience of the ggplot2 grammar, I am always reluctant to move to other plotting systems unless they bring me something big (and even then sometimes you don’t have to leave: for example the absolutely amazing adapter plotly::ggplotly).

Currently, to use datashader you must talk directly to Python and Bokeh (i.e. learn a different language). But what that buys you is massive: in-pixel analytics. Let me clarify that.

datashader makes points and pixels first class entities in the graphics rendering pipeline. It admits they exist (many plotting systems render to an imaginary infinite resolution abstract plane) and allows the user to specify scale dependent calculations and re-calculations over them. It is easiest to show by example.

Please take a look at these stills from the datashader US Census example. We can ask pixels to be colored by the majority race in the region of Lake Michigan:

NewImage

If we were to use the interactive version of this graph we could zoom in on Chicago and the majorities are re-calculated based on the new scale:

NewImage

What is important to understand is that is this is vastly more powerful than zooming in on a low-resolution rendering:

Unknown

and even more powerful than zooming out on a static high-resolution rendering:

Unknown 2

datashader can redo aggregations and analytics on the fly. It can recompute histograms and renormalize them relative to what is visible to maintain contrast. It can find patterns that emerge as we change scale: think of zooming in on a grey pixel that resolves into a black and white checkerboard.

An R example

I am going to share a simple datashader example here. Again, to see the full effect you would have to copy it into an Jupyter notebook and run it. But I will use it to show my point.

After going through the steps to install Anaconda and Juputer notebook (plus some more conda install steps to include necessary packages) we can make a plot of the ggplot2 data example diamonds

ggplot2 renderings of diamonds typically look like the following (and show of the power and convenience of the grammar):

NewImage

A datashader rendering looks like the following:

NewImage

If we use the interactive rectangle selector to zoom in on the apparently isolated point around $18300 and 3.025 carats we get the following dynamic re-render:

NewImage

Notice the points shrunk (and didn’t subdivide) and there are some extremely faint points. There is something wrong with that as a presentation; but it isn’t because of datashader! It is something unexpected in the data which is now jumping out at us.

datashader is shading proportional to aggregated count. So the small point staying very dark (and being so dark it causes other point to render near transparent) means there are multiple observations in this tiny neighborhood. Going back to R we can look directly at the data:

> library("dplyr")
> diamonds %>% filter(carat>=3, carat<=3.05, price>=18200, price<=18400)

# A tibble: 5 × 10
  carat     cut color clarity depth table price     x     y     z
  <dbl>   <ord> <ord>   <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  3.01 Premium     I     SI2  60.2    59 18242  9.36  9.31  5.62
2  3.01    Fair     I     SI2  65.8    56 18242  8.99  8.94  5.90
3  3.01    Fair     I     SI2  65.8    56 18242  8.99  8.94  5.90
4  3.01    Good     I     SI2  63.9    60 18242  9.06  9.01  5.77
5  3.01    Good     I     SI2  63.9    60 18242  9.06  9.01  5.77

There are actually 5 rows with the exact carat and pricing indicated by the chosen point. The point stood out at fine scale because it indicated something subtle in the data (repetitions) that the analyst may not have known about or expected. The “ugly” presentation was an important warning. This is hands on the data, the quickest path to correct results.

For some web browsers, you don’t always see proper scaling, yielding artifacts like the following:

NewImage

The Jupyter notebooks always work, and web-browsers usually work (I am assuming it is security or ad-blocking that is causing the effect, not a datashader issue).

Conclusion

datashader brings to production resolution dependent per-pixel analytics. This is a very powerful style of interaction that is going to appear more and more places. This is something that the Continuum Analytics team has written about before and requires some interesting cross-compiling (Numba) to implement at scale. Now that analysts have seen this in action they are going to want this and ask for this.

Categories: Uncategorized

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

9 replies

  1. Thanks for your post, very interesting indeed! I was wondering what the (conceptual) differences are with R package raster: it seems they overlap in functionality.

    1. There is a relation, but I think the capabilities are very different.

      I’m under the impression that raster computes per-cell representations statically (once and not changing based on user zoom) and is mostly used to produce superimposable layers and not a direct part of an interaction (example here). At one particular zoom you can consider raster cells to be pixels. But overall the cell is an abstract entity largely thought of in Extent units (such as longitude and latitude, not screen coordinates).

      datashader recomputes per-pixel or per-point renderings that change what is calculated immediately as a function of the user’s zoom level. The datashader is aware of the presentation pipeline (here are some notes on that) and creates effects by interacting with the state of the pipeline (the zoom and scroll). There are a lot of interactive examples here.

      As I mention in the article: if you see large pixels upon rescaling it means your web browser is not actually running the datashader portion of the pipeline, but just bokeh. I think this is why there is some confusion as to what datashader does, as not everybody who thinks they have seen has datashader running actually seen datashader running. If your web browser isn’t showing the effect is worth the extra effort try one of the smaller notebooks directly (though finding all the instructions to download the example data can be a pain, eventually I found this).

  2. We are considering making it simpler to call datashader from R, perhaps using Arrow to share a dataframe between R and Python, and would be happy to work with an R user who is interested in that…

    1. I am definitely interested (not a lot of free time though). I can figure out how to move data to Python (feather if worse comes to worse), almost how call Python from R (rPython is Jython I think, and not sure how to use r2py in this situation), and the big blocker would be getting the resulting graph to display in HTML markdown. Even a “hello world” example of this would help me.

      1. I’ve created an issue to track this at https://github.com/bokeh/datashader/issues/304 ; please chime in there with suggestions. I’m not sure what the blocker for the display portion of the pipeline is; that part seems very straghtforward (either a bare array or a PNG image, either of which should be easily plotted in R).

      2. Once the connection to datashader itself is set up, presumably the interactive update can be built within R (using ggplot, shiny, or other usual R tools). Definitely not something I have expertise in, though.

  3. Browsers can handle SVG files really well. I haven’t tried to base64 encode into an embedded SVG, but you can zoom indefinitely on vectors. I am probably going to stick to plotly/ggplot for a while. The data scaling on the axis and labels would be nice, but you could also just do this with breakpoints and some light JS.

    1. plotly and ggplotly are really awesome and we are definitely adding them to our workflows more and more places.

      As for datashader. It may not be obvious- but in addition to infinite zoom datashader can re-calculate the aggregates at different scales. For instances pixels can be purely colored by the majority category in the pixel (instead of averaging).

%d bloggers like this: