Why exponential backoff? To put a finer point on the question, How should I choose the parameters for my exponential backoff algorithm? I think many people choose parameters that back off too much, too fast.
Internally, the shell takes care of all of the error detection, retries, and so forth, so that the programmer can concentrate on the essential issues. The end result is that the system becomes much more robust to load bursts. For example, the following graph shows the performance of many clients submitting batch jobs to a queue using three methods: the Ethernet approach, the Aloha approach (an intermediate step), and a simple fixed retry:
As you can see, the fixed approach crashes to zero after about 400 clients, whereas the Ethernet approach continues to maintain a high level of throughput. It is not as high as the performance under low load, but it is relatively stable over a wide range of load.
The disadvantage to using exponential backoff is that it is going to extend the time to recovery after a failure by about a factor of two. Suppose that you are a client talking to a web server which crashes. You wait one second, try again, then two seconds, and so on. If the web server is unavailable for thirty seconds and then recovers, the client will not notice right away, because it will be in the middle of waiting for thirty seconds before trying again. Now, extending a thirty second outage to a sixty second outage is unlikely to cause any real heartache. But, what about extending a thirty minutes to sixty minutes? That could be irate customer territory.
So, you need to balance the needs of your customers against the capacity of your system. If you you want to handle 1000 clients and have a maximum recovery-after-failure time of one second, then you had better make sure that your system can handle 1000 failed requests per second at a sustained rate. That may sound easy, but if each failed request involves a database query, a write to a log file, and an email to an administrator, then you will be quickly overwhelmed.
Now let's answer the original question: How should I pick the backoff parameters?
Let's assume that they delay chosen at any point is based on an initial timeout (T), an exponential factor (F), the number of retries so far (N), a random number (R), and a maximum timeout (M). Then:
delay = MIN( R * T * F ^ N , M )
- R should be a random number in the range [1-2], so that its effect is to spread out the load over time, but always more conservative than plain backoff.
- T is the initial timeout, and should be set at the outer limits of expected response time for the service. For example, if your service responds in 1ms on average but in 10ms for 99% of requests, then set t=10ms.
- F doesn't matter much, so choose 2 as a nice round number. (It's the exponential nature that counts.)
- M should be as low as possible to keep your customers happy, but high enough that the system can definitely handle requests from all clients at that sustained rate.