A lot of machine learning, statistical, plotting, and analytics algorithms over-sell a small evil trick I call “the hyper dance.”
The hyper dance is the venial trick of pushing user facing technical debt and flaws as user controllable features. These controls are usually named “hyper parameters” and they are parameters or arguments that control the behavior of an algorithm. Users think “hyper parameters” must be even better than “regular parameters”, just like “hyper drive” is better than “sub-light drive.” However the etymology of the name isn’t from science fiction, it is just the need in statistical contexts to have a name for controls other than parameter, as parameter is often used to name the fit coefficients of a model (i.e. to name an output, not an input!).
The risk is: we can have an algorithm that for some settings of the hyper parameters control runs fast, for some settings of the hyper parameters produces a compact output, and for some settings of the hyper parameters is correct. This sounds great. The system authors claim their system is fast, compact, and correct.
Often we find good settings are rare, and there may not even be any settings where all three merits occur together. The algorithm may not have any settings where it is simultaneously fast, compact, and correct. And it may in fact be quite hard to find settings where even one of these properties are achieved.
In some cases, “Freedom’s just another word for nothing left to the right settings.”
I teach hyper parameters as technical debt. If the correct values were obvious and easy to find, then algorithm designer would have set them for you. You can achieve a lot with hyper parameters, but it is going to require some automated search time.
Some good positive examples of parameterized algorithm include:
- Random Forest machine learning. The default hyper parameter settings tend to work well, as random forest comes from an era where algorithms were expected to work.
- xgboost. xgboost has some critical hyper parameter, but supplies an efficient built-in cross-validation framework to find good values for them.
We can easily take hyper parameters to their ridiculous (yet natural) extreme: make a machine learning algorithm that simply copies its hyper parameters out as the source code for the “fit model.” This is going to be:
- Fast. The fitting is a simple copy that doesn’t even look at the data.
- Compact. We can hope the user writes short code.
- Best possible If there was any better model than what we have, we could ask the user to type in that source code as the correct setting of the hyper parameters.
The above, unfortunately, isn’t completely silly, Neural Net topologies are considered to be user supplied hyper parameters. Fortunately good Neural Net topologies are routinely shared, so we don’t have to always search for them from scratch.
Some hyper parameters are inevitable, but always consider them having cost in addition to benefit. Just, please take care not to over celebrate methodologies that require to long a hyper dance.
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.