Just a “heads-up.”
I’ve been editing a two-part three-part series Nina Zumel is writing on some of the pitfalls of improperly applied principal components analysis/regression and how to avoid them (we are using the plural spelling as used in following Everitt The Cambridge Dictionary of Statistics). The series is looking absolutely fantastic and I think it will really help people understand, properly use, and even teach the concepts.
The series includes fully worked graphical examples in R and is why we added the ScatterHistN
plot to WVPlots (plot shown below, explained in the upcoming series).
Frankly the material would have worked great as an additional chapter for Practical Data Science with R (but instead everybody is going to get it for free).
Please watch here for the series.
The complete series is now up:
- Principal Components Regression, Pt.1: The Standard Method
- Principal Components Regression, Pt. 2: Y-Aware Methods
- Principal Components Regression, Pt. 3: Picking the Number of Components
Categories: Administrativia Pragmatic Data Science Tutorials
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
I licked in your book and downloaded free chapter 8, in page 40 about clustering and cosine similarity you say the cosine is between 0 and 90 degrees, the correct range is 0 to 180 degrees, a small nickpicking but your readers are keen of correctness.
Readers definitely deserve a good book. And I appreciate it looks mean to attempt to “correct an attempted correction.” But I’d like to try to explain and clarify.
But I really do not think the book said what you have repeated back. The book said assuming bounded angles, not that angles are bounded (and text analysis is a common analysis domain with reason to assume bounded angles, which I do wish we had had space to expand on).
We are happy to take criticism and maintain a free errata page for our readers (in addition to distributing free chapters, all code and data, and even some videos). I am replying to your comment here in as this is one of the few places where we can try to correct such mis-understandings. Please don’t take this as bullying on my part. I’d really like to make to friends on this issue. Perhaps this could have been clearer in the book and I apologize for any trouble this has caused- but here is what I find when I search in the book.
Chapter 8 is available as a free download and is 40 pages long- so there is no discussion of cosine on page 40 (free sample Chapter 8 available here https://www.manning.com/books/practical-data-science-with-r ). In the book chapter 8 is pages 202 through 237 and the word “cosine” only appears on pages 203, 205, 261 264, 265, 405, 406, 407, and 415. Page 205 has the description most relevant:
The above was stated in the context of a “text analysis” example where the input vectors are commonly collections of indicator, frequencies, co-occurances, and rates and thus non-negative. This is why the section explicitly assumes the angle between vectors is therefore bounded between 0 and 90 degrees and the stated conversion “1 – 2*acos(cossim(x,y))/pi)” is the one used. Yes in a general signed context one would instead use the conversion “1 – acos(cossim(x,y))/pi)” (see https://en.wikipedia.org/wiki/Cosine_similarity).
So the correction I would make is to: emphasize the assumed non-negative nature of the text similarity of the assumed text problem. The stated conversion is in the one often used in the text domain, so it is in fact worth bringing up. I will in fact add some clarification to errata ( http://winvector.github.io/PDSwR/PracticalDataScienceWithRErrata.html ).
Sorry that came out long- it is really hard to be understandable, correct, concise, and set context at the same time.