We share our admiration for a set of results called “locality sensitive hashing” by demonstrating a greatly simplified example that exhibits the spirit of the techniques.Locality sensitive hashing is awe inspiring in its originality, simplicity, beauty and effectiveness. In addition locality sensitive hashing is a remarkable technique as it works even when drastically abridged and simplified. In this paper (link to pdf) we give a description of conditions where the technique works and a heuristic argument why it works (using only elementary math).
Categories: Computer Science Exciting Techniques Expository Writing Opinion Tutorials
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Some simple example code now up on GitHub: https://github.com/WinVector/Locality-Sensitive-Hashing-Example.
It is just a simple example on random data- but even on as few as 10,000 vectors we see reliable results and a 25 times speedup over brute force. Also the simple code deals two major problems seen when working in Java: expense of many small objects and inability to reliably key things off floating point (due to non-repeatiablity default Java floating point).
The detailed log given in the example show how we sweep all projection widths until we find a sweet spot where we are getting sets small enough to inspect (look at fraction inspected, when it is low we are skipping large components).