I’ll admit it: I have been wrong about statistics. However, that isn’t what this article is about. This article is less about some of the statistical mistakes I have made, as a mere working data scientist, and more of a rant about the hectoring tone of corrections from some statisticians (both when I have been right and when I have been wrong).
Used wrong (image Justin Baeder, some rights reserved).
I know, an unapologetic “apologia” opens me and my work to more criticism (or “two for flinching”). But this is what it is can be like for a non-statistician to work with or in front of some statisticians (in fact many statisticians, but none of the big ones). I suspect, from speaking with other data scientists, it is not just me.
I apologize now for displaying a thin skin.
Background
I have made errors when speaking and writing about statistics. This shouldn’t come as a shock as there isn’t anybody who hasn’t made errors. I have no doubt made more errors in writing this article.
But what has been a bit disturbing is many times instead being merely corrected (when I am wrong) or asked for clarification (when I may be right), I am instead publicly accused of being stupid, ignorant, or willfully disseminating falsehoods. I find this disappointing (and I come from a field, theoretical computer science, where if they get excited during your whiteboard presentation they may grab the marker out of your hand to contribute). It would seem graduate schools are not finishing schools.
I’ll admit, I am not a statistician. Some statisticians take that to mean I am ignorant and uneducated in issues of probability. In fact I am very interested in the theory of probability, and come to probability through a fairly long path. I was trained in mathematics and theoretical computer science before becoming a data science carpetbagger. Probability relevant topics I have studied include:
- Integral calculus (including the Riemann, Darboux, and Lebesgue integrals).
- Real and complex analysis.
- Measure theory (σ-algebras, Lebesgue measure, Borel measures and so-on).
- Transform theory (convolutions, Fourier transform, Laplace transform, Z transform).
- Distribution theory (generalized distributions as maps from functions to numbers, allowing things like a rigorous non-limit based treatment of the Dirac Delta Function).
- Kolmogorov axiomatic probability (random variables as measurable functions from probability spaces).
- Geometric probability, volumes, and mixed volumes.
- Markov chains, Martingales, heat/diffusion equations, and stochastic calculus.
- The probabilistic method (linearity of expectation, the Poisson paradigm, pseudorandomness, and more).
- Concentration inequalities (Chernoff bound, Hoeffding’s inequality, Talagrand’s concentration inequality, Lovász local lemma, …).
- The history of concepts of probability (von Mises collectives, de Finetti exchangeability).
- Information theory (entropy, asymptotic equipartition, Kolmogorov–Chaitin complexity, Cramer-Rao inequality, …).
- Betting systems (Kelly criterion, small Martingale, and A/B testing).
I am not trying to say I know a lot. What I am trying to say is: I probably know enough to understand the basis of a statistical concept. If you try to correct me politely, I may be capable of understanding your point and learn from you. My terminology may differ from statistical canon, as I may have learned a common concept in a different field (where it may in fact have a different common name).
My sins
I am going to confess a few of my sins. And as I am writing to a technical audience, I will allow myself some technical examples.
I wrote sum_{i=1...n} (x[i]-mean(x))/n
for variance.
I wrote sum_{i=1...n} (x[i]-mean(x))/n
for variance. Well, not quite. It was in R and I wrote (on page 155 of Practical Data Science with R):
d <- data.frame(y=c(2,2,3,3,3)) m <- lm(y~1,data=d) df <- nrow(d)-length(coefficients(m)) sqrt(sum((residuals(m)^2)/df))
Superficially, there is no “-1” in there. The reason I wrote this code is I was trying to teach the exact calculation needed to reproduce the “Residual standard error” line found in R’s summary(m)
(one of the goals of Practical Data Science with R was to de-mystify summary.lm()
by documenting and showing how to calculate every element of summary.lm()
).
Now there are a few points in my defense.
- I am trying to reproduce standard calculations. So I don’t get to choose if I divide by
n
orn-1
. To explainsummary(m)
I would have to pick the one that matches the existingsummary(m)
. - If you look at this for a moment you will notice that if
n
is the number of rows in our data frame then we have the quantity I am dividing by is in factn-1
(as we havelength(coefficients(m))==1
). I am guessing those correcting me divide byn-1
because they have been told “population variance divide byn
sample variance divide byn-1
” and refuse to accept that Bessel’s correction can be arrived at as an attempt to correct for the number of modeling parameters.
Or, on page 171 (actually by my co-author): “Null model has (number of data points – 1) degrees of freedom.” This is from code reproducing the elements of summary.glm()
(actually my co-author went a bit further to include the chi-squared statistic, which is one standard significance of fit statistic for summary.glm()
– which doesn’t supply any such statistic in the default implementation).
Yet, both of us have received feedback on these sections saying that we have no idea how to compute a sample variance.
Now I am not saying we are perfect. There may (or may not) be places we have written “S/n” instead of “S/(n-1)”. What I am saying is: even this wouldn’t indicate we didn’t understand the distinction. But when corrected I am never accused of mere carelessness (which I would in fact like to apologize for), but of willful ignorance. The aggressive correction is almost always “clearly you don’t know you need to divide by n-1” instead of “wouldn’t dividing by n-1 be better?”
In practice things are not as simple as always mechanically applying the one true method. We are always searching for better estimates (less bias, lower variance, weaker assumptions, or easier to calculate). I guess if you are used to teaching statistics to out of major beginning students you develop a strong binary and didactic feeling of right (the procedure you actually taught in the course) and wrong (anything else the students write, often in fact gibberish). I am not trying to be inclusive or think relatively here, it is just correcting people is a bit harder than one would think (so it pays to be polite when attempting it).
For example: is Sn = sum_{i=1...n} (x[i]-mean(x))/n
in fact a wrong estimate of variance? It clearly is good enough as n
gets large. I would also argue it is not in fact wrong. Oh, it is biased (tends to be too low on average), and Sn1 = sum_{i=1...n} (x[i]-mean(x))/(n-1)
is unbiased. But bias is not the only concern we might be trading off. We know Sn
is a maximum likelihood estimate (desirable for its own reasons) and also a lower variance estimate (or more statistically efficient) than Sn1
.
[ By Cochran’s theorem we know that Sn
is distributed S chi-sq(n-1)/n
and Sn1
is distributed S chi-sq(n-1)/(n-1)
(where S
is the unknown true variance). But the chi-sq(n-1)
distribution has mean n-1
and variance 2n-2
. So Sn1-S
is mean 0
variance 2 S
, and Sn-S
is mean -S/(n-1)
variance (2-2/n) S
. Thus in moving from Sn
to Sn1
we removed -S/(n-1)
units of bias in exchange for 2 S/n
units of additional variance. For large enough n
this is not an obvious good trade. ]
I used the word “heteroscedastic” wrong.
In writing about linear regression I used the word “heteroscedastic” wrong. The issue is you are heteroscedastic if “if there are sub-populations that have different variabilities from others” (Wikipedia heteroscedasticity). Based on my readings of the history of probability (the problems von Mises ran into with controlling selections of sub-populations, and the importance of exchangeability to de Finetti’s formulation of probability) I would propose altering the definition to: “computationally identifiable sub-populations (either through explicit or omitted variables).”
However, I (in error) wrote in a footnote: “Heteroscedastic errors are errors whose magnitude is correlated with the quantity to be predicted.”
There are in fact at least four things wrong with what I wrote.
- If we assume a generative model
y[i] = b.x[i] + e[i]
then of course the error terms are correlated withy
s as they are in there!. What I was thinking is: we have a problem if the errore
is correlated with the signal or systematic portion ofy
:b.x
or thex
. - I should have written “unsigned magnitude.” I am a computer scientist, to me magnitudes are the unsigned portion of a numeric representation (as in “sign and magnitude”). However, magnitude is also used for logarithmic scales of things like power and in such senses can be negative (such as negative earthquake magnitudes).
- I should have written “depends upon” or “varies with” instead of “correlated.” I am fully away we can have an unsigned magnitude that is a non-constant function of a numeric variable, but not correlated with it. For example with the
(x,y)
pairs{(-1,2),(0,1),(1,2)}
we can writey
as function ofx
but there is no linear correlation. - I need to spend some time explaining we are talking about expected errors/variabilities, not mere instantiated empirical errors (observed or unobserved).
I truly wish I could condense good statistical advice down as efficiently as problems pile up (when you don’t have space for a lot of caveats, and worked examples to indicate intent and meaning). I’ve tried to address this in errata, but still regret my error. I also regret bringing the issue up at all, given I didn’t have enough space to layout enough context and caveats. Really what I wanted to discuss is when it is safe to apply linear regression (the topic of our next section).
I assumed normality in linear regression.
I said you need to check that the residuals are normal when using linear regression (and hopefully I never accidentally wrote you need the dependent or independent variables to be normal, as that is way too restrictive). This is open to criticism.
However there is a bit of gamesmanship going on here. You can pretty much criticize any position one takes on normality of errors. It goes like this:
- If the writer claims you need normal errors you can point out that this in not any of the requirements of the Gauss-Markov Theorem (the big-boy theorem in linear regression since at least 1809, though Gauss used a Bayesian argument based on normal distributions). More important are model structure, uniformly bounded variances, and homoscedasticity.
- If the writer doesn’t claim you need normal errors, you can point out that it is standard practice to check normality of errors through the q-q-plot (so it is negligent not to worry about this). You can further ask how the writer expects to apply coefficient t-tests, and goodness of fit F-tests without strong distributional assumptions. You can also find references that claim you do need normal errors and even normal variables (which is not in fact necessary).
The actual case is there is a lot of disagreement on what are the convenient assumptions needed for reliable linear regression (see here, here, here, and here). You can find well regarded statisticians on just about any side of this: depending on the context (are they modeling, teaching, or proving theorems). Just don’t have an opinion either way if you are a non-statistician.
The minimal assumptions do not in fact seem to include normality of residuals, x
s, or y
s. However, minimal may not be the same as convenient in all contexts. For example: in teaching you might invoke normality to give a concrete property and simpler special-case proofs (which goes on to being mis-quoted out of context as the general case necessity). Or: if for domain-specific reasons you known the errors should be normal (such as expected measurement errors when estimating the orbits of celestial bodies) then it pays to check if the residuals are normal (as if the residuals are not normal, they must include things other than the known to be normal errors).
There is also the issue of what do you mean by “linear regression”? If you are just going to fit the model and use it without looking at the diagnostics, you may need fewer assumptions than somebody who (wisely) decides to look at the diagnostics. What additional assumptions you need depend on what tests you run. For example if you run an F-test on the goodness of fit you are then “sensitive to non-normality“, but this is non-normality of the sampling distribution (which can itself be normal even with non-normal residuals under fairly mild assumptions, if you have enough data). However, some derivations of the F-test assume normality of errors (the strong assumption we would rather not have to make; for example result 7.2.1 on page 220 here). Similarly the t-tests on coefficient significance have some distributional assumptions- which can either be satisfied by strong assumptions on the residuals, or weak assumptions on the residuals and a lot of data (making the sampling distribution well behaved).
Frankly a reason it is hard to state reasonable assumptions for regression is there doesn’t seem to be a community agreed upon good primary source for what are considered the strongest derivations (hence with weakest assumptions) of the F-test (and for the chi-square or log-likelihood tests for logistic regression). A clear sign of what the community considers the best proof method would be a great boon (for instance in mathematics collecting proofs by style is considered very important as it hints what holds as we vary assumptions and domain; example Proofs of Fermat’s little theorem). Most sources either neglect the F-test, merely use the F-test, derive it using strong assumptions of normality, deriving in a non-regression setting (single variable contingency tables, or single variable ANOVA), or derive it only for single regression (which doesn’t show how the number of parameters adjust the degrees of freedom). With the right reference identifying the right conditions is a simple matter of proof analysis. I have found derivations of the F-test (including its ANOVA history), but if they are all considered “bad” any conclusions derived from them would be hard to defend. So I’ll say it: I don’t know a good reference for the derivation of the F-test (especially one that answers if residual normality is considered a standard requirement and one that deals directly with multiple regression), and I would appreciate a recommendation.
Our commitment
We are going to keep writing and keep teaching on statistical topics. We choose and arrange topics to reflect what we have found to be the most important issues in practice. We do try to make everything correct, but the order we address things is determined by historic project impact and not traditional teaching order. Just because we haven’t addressed a topic yet doesn’t mean we don’t know about it, it is just that something more urgent may have cut in front. Because of this we feel our work (book, course, blog) are some of the best ways to build data science mastery.
Conclusion
In 2009 Hal Varian, chief economist at Google, famously said:
I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.”
(NYT 2009).
There is some concern that computer science style data science is stealing some of the thunder. Mostly I agree with Professor Matloff: statistics has a lot to offer that is needlessly being missed. Though I also wonder if we are both encroaching on a field that has been historically owned by operations research.
I myself am considered a relatively statistically friendly data scientist who often criticizes data science in statistical terms (for example my recent talk a the recent Data Science Summit: “Statistics in the age of data science, issues you can and can not ignore.”). I repeat: for all my faults I am one of the data scientists who has a more friendly (better than say, median) relationship with statisticians.
That being said: I’ll weigh in on the “why is no statistical vision triumphant?” Data science is rising not just because of data engineering. It is also rising because the business environment is largely collaborative and requires effective consulting skills and behaviors. Collaborating in a positive way with others (valuing their work, even while improving their working style) is decisive. Live and teach that and you win. Yes, I am saying software engineering is somewhat socialized (has long dealt with issues of training, project management, agile practices, working with uncertainty, recording history, and remote collaboration). I am also saying: for any of us (including myself) self improvement and collaboration are always quicker ways to get ahead than trying to pull down others (as there are too many others for tearing down others to be an effective strategy).
I want more statisticians on my data science teams. We can even call it a statistical applications team, as long as that doesn’t get me kicked off the team.
“I was wrong” Sisters of Mercy
So I was wrong I was wrong to ever doubt I can get along without
Categories: Opinion
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Ignore Matloff and listen to Tukey: “For a long time I thought I was a statistician …” (The Future of Data Analysis). Even Tukey didn’t want that label anymore, it was the year 1961. The trouble with statistics predates data science and, by and large, even computer science. Mandatory read for the data science deniers. The short of it is a confusion between the mathematical aspects with the data analysis practice and an unjustified primacy of the former over the latter.
Well I am actually a big Matloff fan, I always learn something when I read him or speak with him. So I assume you mean “ignore the Matloff essay.” That being said, I had read some Tukey- but I looked up the essay and the first couple of paragraphs are fairly cutting (as you implied). I did not remember that. Thanks for the reference.
I sympathise…
Any time you get some kind of crossover between two disciplines, there is the potential for conflict over the perception concerning the field of interest. No field seems to be immune to this, and it has happened for as long as I have been alive (and a whole lot longer than that, I shall wager…).
So, a statistician might feel aggrieved concerning the way one person articulates a statistical argument. A software writer might be appalled at the way a statistician writes some piece of code. The issue isn’t discipline specific – it’s a reflection of the behaviour of the individuals concerned.
We live in times when it has been argued that all the easy problems are solved (Feldman had much to say on this). Only the hard problems remain. And those problems will not be solved by mathematicians, statisticians, computer scientists, data scientists, economists, or (insert whatever iconic discipline you wish). The problems will most likely be solved by an unholy alliance between all these professions, and a whole lot more besides. You need a special way of working if you want to work in this kind of problem solving environment, and the grumpy statisticians you’ve encountered probably will not survive.
I work in a heavily inter-disciplinary environment, so this rings pretty familiar to me. Most of the time my colleagues in the disciplines are pretty tolerant of my infelicitations, but perhaps that’s because we’ve learnt to accommodate each others’ weaknesses, and we’ve deftly side-stepped those who have been less helpful to our cause. Over a few decades, we’ve survived, and the curmudgeons have not. Perhaps there’s a lesson there.
I dislike the current debate on “data science” anyway, mainly because I really do not have a clue what that means, even though I do “data science” every day. I know about this problem or that problem, but the general term is used in a manner that is too fuzzy and ill-defined in practise to be of practical value.
So keep writing. Make mistakes. Mistakes are part of being human. Learning from mistakes is the key, and there will always be someone who will help. Just ignore those who hinder.
Oh, and I’m a statistician, from a mathematical background, who used to be an engineer.
Than you! My first job (after computer science graduate school and a mathematics post-doc at MSRI) was in biotech working with chemists and biologists. It was cross discipline, but I guess the disciplines were far enough away that it was easier for both sides to see them as complementary (or not competing).
I am a statistician. And I agree with you.
Statisticians tend to be nit-picky and defensive.
We need to be more open to others.
I would like to think that the people who have criticized you
are not representative of my field but I am not sure.
And any statistician who hassles you about whether to
divide by n or (n-1) is really missing the big picture.
Larry Wasserman
Thanks, I actually have gotten good support from actual statistics professors (such as yourself; as I said in the rant “none of the big ones [statisticians]” are the problem). Looking at my shelf it looks like one of my mistakes is I haven’t yet purchased “All of Statistics” yet. I do have (and appreciate) “All of Nonparametric Statistics.”
Also, Professor Matloff sent me a kind note reminding me that the same tyros that worry about the “n-1” seem fine with standard deviation (even though you can’t bias correct it in general and the correction even for the normal is a mess) and don’t seem to insist on the same corrections if you ask about proportions instead of counts.
Some people just have no idea of how to help themselves. I’m speaking about (certain) statisticians in this case. Allow me to give a bit of personal history, and present my view of what is going on. Then, after hopefully not a huge amount of writing, I’ll get to the above point.
I was trained (or at least I went to school to train) as a (pure) mathematician, then switched to statistics in late undergrad when the level of abstraction became somewhat annoying. (I’m not against pure mathematics – my background in it has been incredibly useful – but I, personally, wanted something that I could sink my teeth into more readily.) Then, although I didn’t like the way computer science classes were taught (except the theoretical ones) and likely because I’d been coding for fun since I was a wee lad, found my niche in computational statistics.
Forward through to the end of my statistics PhD (in time series, but with a lot of consulting, rebelling against time series, and active interest in machine learning), titles were such that “statistician” was losing ground to “data scientist.” However, this was not too long (3 years or so) after statistics was declared the sexiest job of the century.
Why do I think this happened? Well, there are lots of reasons. One is that statisticians (at least used to, and often still do) see themselves as frequentist or Bayesian. Most people looking on this site will probably know the difference, so apologies to those who don’t… it’s a long history, but you can Google it to your heart’s content. Now, in the recent ML literature, there’s slightly more bias towards Bayesian statistics, because it has potential to tell you more. The division in statisticians on one side vs. the other can still be quite vicious. In ML literature, it’s not nearly such of a problem. Of course, frequentist and Bayesian philosophies are different, but there are things one side can do that the other has difficulty with, and vice versa. What is not (often) appreciated is that both camps have perfectly valid arguments, based on their axioms. Machine learning researchers often have no problem swapping axioms, but statisticians… well, there is still a large divide. Sure, you can’t say, for example, certain things with frequentism that you can with Bayesianism, but that doesn’t mean that frequentist techniques aren’t useful. They can be incredibly so.
Mostly, though, statisticians dropped the ball. They mishandled things. Partly because the only statistics courses that aren’t taught badly are the hard-core ones (that is, the ones for statisticians.) Oh, I’m sure there are statistics for the sciences courses taught well, but they’re few and far between. I tutored a lot in grad school. I know.
It was also partly because of the deluge of data that suddenly came from everywhere. Statisticians on the whole were caught unawares, and computer scientists, who naturally had more to do with databases than they did, picked up the slack. However, when the computer scientists/others doing this reinvented the wheel and called what they were doing new that statisticians became slightly… annoyed. (I’ve seen ordinarily linear regression called a machine learning technique. I love machine learning techniques, and yes, technically it is one. But it has been a part of statistics for longer.)
More annoying to statisticians (myself among them) were the sometimes ludicrous claims that statistics was becoming unnecessary. This also lead to a certain amount of fear, as some of those claims, while ludicrous, were accepted by not only the general public (who some, but certainly not all, statisticians seem to have a disdain for), but in industry and even academia.
The truth is, much of what is called data science is based in statistics. Unfortunately, due to some re-inventing of wheels, erroneous claims, and the like, this is often not recognized. And, perhaps understandably, this has caused a great deal of fear and loathing for the term “data science” from many statisticians.
This is likely why (at least some) statisticians are attacking data science, which is probably the wrong move. I call myself a statistician and a data scientist. I know enough about computation, programming, business, etc., that I can deem myself worthy of the latter title, especially since I have a rather large amount of statistical knowledge. (The other unfortunate thing about the title data scientist is that it seemed to spring from the ether… which means there is no one acceptable definition of the term. This is getting long, so I won’t gripe about that.) Statisticians, in my opinion, anyway, should embrace the term data science, and, for the good of all, make sure there is enough solid statistics in the field so there aren’t critical applications failing because said statistics is absent.
So… I think those who attacked, well, they don’t have their priorities right. They’re so firm for the word statistics and against the name data science that they don’t see they could make a lasting impression by building up data science, but supporting it.
One last thing. For the sample variance, you were both right and wrong for having division by n instead of n – 1. The maximum likelihood estimator of the variance divides by n. However, it’s biased. The -1 is in the denominator simply because (most) applications work better when the sample variance is unbiased. However, there are cases when having a biased estimator improves prediction.
Those opposing kinds of discussions between data science, research and statistics mystifies me. Especially with data science. It’s all the same, scientists form a question, collect data, design the research and statistical method to find solid conclusions and answers.
Robert, I totally agree. And yours is one definition of data science. It’s one of the few that works. Unfortunately, I have seen advocates of “data science” basically say if you throw enough data at some algorithm the result should be correct. Never mind that they are often making wild assumptions that, usually, they don’t know they are making.
However, I think data science, as a whole, may be starting to become more of a coherent practice, at least I hope. I must admit that I am seeing fewer grand claims with absolutely no theory or scientific principles to back them up. Then again, I could be looking in the wrong places.