Menu Home

Smoothing isn’t Always Safe

Introduction

Here is a quick data-scientist / data-analyst question: what is the overall trend or shape in the following noisy data? For our specific example: How do we relate value as a noisy function (or relation) of m? This example arose in producing our tutorial “The Nature of Overfitting”.

One would think this would be safe and easy to asses in R using ggplot2::geom_smooth(), but now we are not so sure.

Our Example

Let’s first load our data and characterize it a bit

##   m      value
## 1 3 -12.968296
## 2 3  -5.522812
## 3 3  -6.893872
## 4 3  -5.522812
## 5 3 -11.338718
## 6 3 -10.208145
##        m              value        
##  Min.   :   3.0   Min.   :-18.773  
##  1st Qu.:  86.0   1st Qu.: -1.304  
##  Median : 195.0   Median : -1.276  
##  Mean   : 288.8   Mean   : -1.508  
##  3rd Qu.: 436.0   3rd Qu.: -1.266  
##  Max.   :1000.0   Max.   : -1.260
## [1] 15545

Now let’s try and look at this data. First we try a scatter plot with a low alpha, which gives us something similar to a density presentation.

Unnamed chunk 4 1

Each m value has many different value measurements (representing repetitions of a noisy experiment). Frankly the above is not that legible, so we need tools to try and summarize it in the region we are interested in (value near -1.25).

Trying Default Smoothing

Let’s run a default smoothing line through this data to try to get the overall relation.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Unnamed chunk 5 1

This graph appears to imply some sort of oscillation or structure in the relation between mean value and m. We are pretty sure there is no such structure, and this is an artifact of the smoothing method. This defect is why we did not use ggplot2::geom_smooth() in our note on training set size.

We did see a warning, but we believe this is just telling us which default values were used, and not indicating the above pathology was detected.

At this point we are in a pickle. We had theoretical reasons to believe the data is a monotone increasing in m trend, with mean-zero noise that decreases with larger m. The graph doesn’t look like that. So our understanding or theory could be wrong, or the graph didn’t faithfully represent the data. The graph had been intended as a very small step in larger work. Re-examining the intricacies of what is the default behavior of this graphing software was not our intended task. We had been doing some actual research on the data.

Now have a second problem: is this unexpected structure in our data, or a graphing artifact? The point is: when something appears to work one can, with some risk, move on quickly; when something appears to not work in a surprising way, you end up with a lot of additional required investigation. This investigation is the content of this note, like it or not. Also in some loud R circles, one has no choice but to try “the default ggplot2::geom_smooth() graph”, otherwise one is pilloried for “not knowing it.”

We can try switching the smoothing method to see what another smoothing method says. Let’s try loess.

## `geom_smooth()` using formula 'y ~ x'

Unnamed chunk 6 1

Now we have a different shape. At most one of these (and in fact neither) is representative of the data. There is, again, a warning. It appears, again, to be a coding style guide- and not detection of the issue at hand.

Looking Again

Let’s try a simple grouped box plot. We will group m into ranges to get more aggregation.

Unnamed chunk 7 1

For legibility, we repeat these graphs zooming in to the area under disagreement. We are using coord_cartesian() to zoom in, so as to try and not change the underlying graphing calculation.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Unnamed chunk 9 1

This crossing above -1.0 is very suspicious, as we have max(d$value) = -1.2600449. We have annotated this with the horizontal red dashed line.

And the entirety of the loess hump is also a plotting artifact, also completely out of the observed data range.

## `geom_smooth()` using formula 'y ~ x'

Unnamed chunk 10 1

The zoomed-in version of the box plot shows the noisy monotone asymptotic shape we expected for the original experiment that produced this data.

Unnamed chunk 11 1

The point plot, when zoomed, qualitatively agrees with the boxplot.

Unnamed chunk 12 1

Directly calling loess/lowess

ggplot2 is documented as using loess, which in turn is documented as a newer adapter for lowess “with different defaults” then loess. However, the documented exposed controls on these two methods seem fairly disjoint.

That being said loess (without a ‘w’, as in “Uruguay”) called directly with default arguments shows the same chimeric artifact.

Unnamed chunk 14 1

Playing with arguments can suppress the artifact, but we still saw weird (but smaller) effects even with the suggested degree = 1 alternate setting.

Directly calling lowess (with a ‘w’, as in “answer”) gives a more reasonable result out of the box.

Unnamed chunk 15 1

Simple Windowing

Simple methods from fields such as signal processing work well. For example, a simple square-window moving average appears to correctly tell the story. These are the methods I use, at the risk of being told I should have used geom_smooth().

## Loading required package: wrapr

Unnamed chunk 16 1

The fact that the hard window yields a jagged curve gives an indication of the amount of noise in each region of the graph.

Conclusion

Large data sets are inherently illegible. So we rely on summaries and aggregations to examine them. When these fail we may not always be in a position to notice the distortion, and this can lead to problems.

Many of the above default summary presentations were deeply flawed and showed chimerical artifacts not in the data being summarized. Starting a research project to understand the nature of the above humps and oscillations would be fruitless, as they are not in the data, but instead artifacts of the plotting and analysis software.

As a consultant this is disturbing: I end up spending time on debugging the tools, and not on the client’s task.

The above were not flaws in ggplot2 itself, but in the use of the gam and loess smoothers, which are likely introducing the artifacts by trying to enforce certain curvature conditions not in the data. We are essentially looking at something akin to Gibbs’ phenomenon or ringing. This could trip up the data scientist or the data analyst without a background in signal analysis.

This sort of problem reveals the lie in the typical “data scientist >> statistician >> data analyst” or “statistics are always correct in R, and never correct in Python” snobberies. In fact a data analyst would get the summary shapes right, as presentation of this sort is one of their specialties.

Categories: Pragmatic Data Science Tutorials

Tagged as:

jmount

Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.

10 replies

  1. Your zoomed point plot provides important information about the data. Perhaps if you had looked at that first, you would not have used these methods and had these problems. Graphical Data Analysis should come first (well, I would say that :-)).

    Are the data real and available?

    1. I did look at more than one graph, under the idea that one look is not always enough. The WVPlots::ConditionalSmoothedScatterPlot shown at the end is my go to for this sort of situation, it is a good compromise of legibility and faithfulness. In fact it is the summary graph family I use in the original article. This follow-up not is the set of visualizations I explored to confirm what visualization was legible and respected what was in the data.

      The data is real and can be found here: https://github.com/WinVector/Examples/tree/main/The_Nature_of_Overfitting

      1. Thanks for the reply. If I understand the webpage correctly, the data are output from analyses of a fake dataset, so they are not really real, are they?

      2. The data are measurements of the performance of an actual logistic regression on a synthetic data set as a function of synthetic data set size. So they are actual measurements of a simulation experiment with respect to a specific generative model.

  2. I think geom_smooth() with appropriate parameters is good enough. In this case gam method with bs = ‘ad’.

    sus_shape %>%
    ggplot(aes(m, value)) +
    geom_point(colour = “#3498DB”, alpha = 0.5) +
    geom_smooth(method = ‘gam’, formula = y ~ s(x, bs = “ad”), se = FALSE, size = 2) +
    coord_cartesian(ylim = c(-1.5, -1.20))

    1. Thanks, the 'ad' spline choice (adaptive smoother from help(smooth.terms)) does indeed do a nice job (much better than the default 'cs' for this example).

      1. Rather than using the massive sledgehammer of a locally adaptive smooth, we need to consider that a GAM is a (semi) parametric model with distributional assumptions. It is decidedly not a scatter plot smoother kind of method. I can’t see how your data meet the default assumption that the data are conditionally distributed Gaussian with constant variance.

        This is a personal bug bear of mine with the use of `gam()` by `geom_smooth()` as it is encouraging poor statistical practice

      2. I agree. The defaults should be good practice that represents that data. Not some model used in an unsafe manner.

  3. I have found supsmu() (supplied with R in the stats package) to be the best function for automated smoothing.

    d <- read.csv( 'https://raw.githubusercontent.com/WinVector/Examples/main/The_Nature_of_Overfitting/sus_shape.csv' )
    ## stringsAsFactors is not needed, since there is no character data
    
    head(d)
    
    with(d, plot(m, value, pch='.'))
    sm <- with(d, supsmu(m, value) )
    lines(sm, col='red')
    
    ## "zoom" in
    xrng <- c(min(d$m), 100)
    
    with(d, plot(m, value, xlim=xrng, pch='.') )
    lines(sm, col='red')
    
    head(unique(d$m))
    # [1] 3 4 5 6 7 8
    
    
    For a graphical presentation, I might do something like this (similar to the boxplots, but doesn't group the d$m vaues)
    
    ## get the range of values for each unique value of d$m
    tmp <- split(d$value, d$m)
    vrngs <- lapply(tmp, range)
    vys <- do.call(rbind, vrngs)
    vxs <- as.numeric(names(vrngs))
    
    ## plot the ranges (zoomed)
    with(d, plot(m, value, xlim=xrng, type='n') )
    segments(vxs, vys[,1] , vxs , vys[,2], col='gray')
    lines(sm, col='red')
    
    ## plot the ranges (not zoomed)
    with(d, plot(m, value, type='n') )
    segments(vxs, vys[,1] , vxs , vys[,2], col='gray')
    lines(sm, col='red')
    
    It also might be better to put a smoothed line through the medians of each d$m
    But even without a smoothed line, plotting the medians reveals something that might be meaningful.
    
    ## medians (zoomed)
    vmeds <- lapply(tmp, median)
    
    with(d, plot(m, value, xlim=xrng, pch='.') )
    points(vxs, vmeds, pch=18)
    
%d bloggers like this: