I was flipping through my copy of William Cleveland’s The Elements of Graphing Data the other day; it’s a book worth revisiting. I’ve always liked Cleveland’s approach to visualization as statistical analysis. His quest to ground visualization principles in the context of human visual cognition (he called it “graphical perception”) […]
Estimated reading time: 17 minutes
I know “officially” data scientists all always work in “big data” environments with data in a remote database, streaming store or key-value system. But in day to day work Excel files and Excel export files get used a lot and cause a disproportionate amount of pain. I would like to […]
Estimated reading time: 11 minutes
We have added a worked example to the README of our experimental logistic regression code. The Logistic codebase is designed to support experimentation on variations of logistic regression including: A pure Java implementation (thus directly usable in Java server environments). A simple multinomial implementation (that allows more than two possible […]
Estimated reading time: 11 minutes
It’s often the case that I want to write an R script that loops over multiple datasets, or different subsets of a large dataset, running the same procedure over them: generating plots, or fitting a model, perhaps. I set the script running and turn to another task, only to come […]
Estimated reading time: 12 minutes
I am going to come-out and say it: I am emotionally done with 32 bit machines and operating systems. My sympathy for them is at an end. I know that ARM is still 32 bit, but in that case you get something big back in exchange: the ability to deploy […]
Estimated reading time: 5 minutes
A constant problem for computer science (since its inception) is how to manipulate data that is larger than machine memory. We present here some general strategies for working “out of core” or what you should do when you run out of memory. Early computers were most limited by their paltry […]
Estimated reading time: 15 minutes
We extend the ideas of from Automatic Differentiation with Scala to include the reverse accumulation. Reverse accumulation is a non-obvious improvement to automatic differentiation that can in many cases vastly speed up calculations of gradients.
Estimated reading time: 1 minute
This article is a worked-out exercise in applying the Scala type system to solve a small scale optimization problem. For this article we supply complete Scala source code (under a GPLv3 license) and some design discussion.
Estimated reading time: 36 minutes
Readers returning to our blog will know that Win-Vector LLC is fairly “pro-R.” You can take that to mean “in favor or R” or “professionally using R” (both statements are true). Some days we really don’t feel that way.
Estimated reading time: 7 minutes