Recently Heroku was accused of using random queue routing while claiming to supply something similar to shortest queue routing (see: James Somers β Heroku’s Ugly Secret and more discussion at hacker news: Heroku’s Ugly Secret). If this is true it is pretty bad. I like randomized algorithms and I like queueing theory, but you need to work through proofs or at least simulations when playing with queues. You don’t want to pick an arbitrary algorithm and claim it works “due to randomness.” We will show a very quick example where randomized routing is very bad with near certainty. Just because things are “random” doesn’t mean you can’t or shouldn’t characterize them.
We are going to rigorously analyze a very simple queueing problem without bringing in any big hammers of queuing theory. To get away with this we will restrict ourselves to a very simple queueing system. The simple system is picked to show that the problem with bad queueing algorithms is in the bad algorithms them selves and not due to any exotic “heavy tail behavior” of this or that distribution.
Suppose a job taking 2 seconds to process arrives every one second, on the second, into our server farm. Further suppose we have two servers in our server farm and a router that can pick where to send the job instantly. To say it again: for our entire server farm we regularly get a job taking 2 seconds once every second and we have two servers we can instantly route to (so we expect to be able to keep up).
The idea deterministic routing schedule in this case is to route to the server with the shortest (or even empty) job queue. If we start with both servers idle and route the first job to first server we never have to wait, we work with no delays, no idle servers and perfect efficiency. The jobs slip into the 2 servers just as previous jobs finished. We illustrate this in the figure below. We show the individual server queues (post router) for each of two servers at the beginning of a few time ticks.
In this design, once something is placed in a server queue it stays there until executed. Busy servers being able to send jobs back to the router would be a nice feature, but seems to not be present in the system everyone is talking about. Now suppose we make the “harmless” change of routing each job to a random server. The longest job queue will grow arbitrarily while slowly bouncing from server to server. This will happen with near certainty. This randomized routing algorithm is pretty much guaranteed to fail.
And here is why (without any math beyond addition and subtraction). At every time tick 2 units of work arrive into your server farm. Also at every time tick your farm performs a number of units of work proportional to the number of non-empty servers. The “always route to the server with the least to do” algorithm made sure there were never any empty or idle servers, so every time tick the server farm did 2 units of work and the farm kept up. The random routing algorithm sometimes leaves a server empty. Every time this happens the server farm does only one unit of work as the empty server does nothing useful. When this happens you fall one more tick behind. The total queue length of both servers after the first tick is going to be (as in our illustration above) 3 units plus how many units of work you are behind. The number of units of work you are behind is always exactly the number of time ticks there was any empty server. In this idealized problem you can never catch up. Any tick lost is lost forever. This is an invariant of the system: missed ticks equals exactly how far behind you are. Or the idea is: don’t look at the overwhelmed servers to explain where things are going wrong, look at the idle ones.
How fast do you fall behind? My estimate is you expect to be about one half of square-root of n ticks behind at time tick n. So the rate you are falling behind at decreases because more and more often you have both server queues full (and possibly both very long as you are falling behind). The intuition here is that (except for queue lengths of zero) each server is seeing its queue shrink by one unit (it didn’t get a new 2 units of work and was able to quietly shrink its queue by one unit) or grow by one unit (it does one unit of work, but is awarded two more units for a net growth of one). These two events happen with equal probability of 1/2 each except in the empty state we either stay empty or move to depth 2. This sort of notation of moving from state to state is what we call a Markov chain. The Markov chain for a single server observed alone is shown in the diagram below:
Let’s alter this Markov chain slightly. We will change growth transition from empty to move up only one (like all the other up transitions do) and also change the self transition at zero to also move up 1. This gives us the more regular Markov chain drawn below:
The above chain has pretty much the time to empty characteristics as the first Markov chain. In fact only the transitions out of the empty queue state have been changed, so any calculation of how long it takes to get to the empty state (where we lose a tick) from a given non-empty state will be exactly the same.
It is a fun fact of Markov chain theory that the above Markov chain (called a chain with a random walk with a reflecting barrier) is exactly equivalent to the Markov chain below if we ignore the minus signs on the new state labels.
This chain is called the drunkard’s walk on the integers (it stumbles left and right with equal probability). It is also equal to the number of heads minus the number of tails seen when we flip a fair coin a large number of times. And here comes the punch-line. In flipping a fair coin n times we will see the number of heads minus the number of tails to be often about the square-root of n (the binomial distribution which this process follows has variance n/4 which is enough to establish this, even though the expected value is zero). So the Markov chain for a single server in our server farm to be about square root n deep about n steps into the simulation on average. This depth is built by accidentally idling the servers from time to time (so roughly we expect one of the servers to go idle every square root n time units after running for n time units, getting empty is getting rarer but still happening).
Without a lot of math we have characterized the overall behavior or random routing in this simple server farm: you fall behind and never catch up. This is unfortunate as the route to server with shortest queue algorithm keeps up effortlessly. Also notice that this doesn’t depend on any weird long running jobs, the failure of random routing is due to the wastefulness of the randomized routing algorithm (and the packed schedule that gives us no chance to catch up).
(Changed title from “deterministic” as the failure is not really “deterministic” the actual technical phrase is “almost certain” or “probability 1”, i.e. it will fail except for perfectly evenly spaced sequences like HTHTHT… .)
(Sounds like Heroku admits there is a problem. I hope they move on to re-introduce smart routing.)
Categories: Computer Science Expository Writing
jmount
Data Scientist and trainer at Win Vector LLC. One of the authors of Practical Data Science with R.
Can you elaborate on what you mean by “In flipping a fair coin n times we expect the number of heads minus the number of tails to be about the square-root of n (the binomial distribution which this process follows has variance n/4 which is enough to establish this)”
wouldn’t the expectation of heads minus tails be zero? your argument makes sense I just got lost a little here
@sana horia Yes, I was a little careless with language there. The expected value is zero. What I mean is we will see typically |heads – tails| = sqrt(n)/2 very often as this is the standard deviation (and we can’t always be below that, by theorems like the law of the iterated logarithm).
Una hirundo non facit primavera.
You picked a use-case that requires 100% efficiency, and an algo that exactly fits that use case. Then you showed that randomized algo does not work well in that case. Does this mean that the random algo is bad?
No, it does not. It just means for this specific case, the other algo is better.
To add to that, the use case is hardly common in practice. Heroku cannot know the job cost upfront. Also, no company (they included) should rely on 100% efficiency: that’s a recipe for disaster.
To be clear, this is not an endorsement for Heroku: I think what they did is sleazy. However, randomized algorithms have their place, and frankly, perform well in a lot of cases, especially when load is not known upfront, and you don’t want a lot of implementation complexity (i.e. an engineering/good-enough solution)