Wednesday, August 21, 2013

From Tolerating Faults To Being Recovery Oriented

Many years ago I got into the habit of rebooting my work PC at least weekly. I don’t recall the exact circumstances which prompted this; after all, one of the bragging points for an operating system is how long you can go between reboots.

But I was raised back in the late Win 3.1 and Windows 95 days. Once had a job walking to each computer in every lab that we maintained with floppy disks to update network card drivers. We digress but the point is that back in the early days rebooting was a daily, if not multiple times daily, affair.

Another motivating factor was the transition from CGI to FastCGI back in the late 1990s. Instead of starting a massive executable to handle a single request, FastCGI allowed a single process to handle multiple requests. 

Porting a massive (at the time, to me at least) executable that:

  1. Was riddled with the simplifying assumption that it will only ever process a single request at a time.
  2. And was therefore full of global variables (some even initialized statically/at-compile-link-time).
  3. Was exceptionally brittle as these assumptions weren't explicit.
  4. Was developed by brilliant but inexperienced engineers and therefore re-discovered some of the errors a basic CS education (formal or informal) helps you avoid.
One of the things I learned back then was that even after it was mostly successful (single process successfully handling multiple requests without crashing) part of its success came from limiting the number of requests a single process handled before being recycled.

This brings us back to the title of this post. It struck me that limiting the number of requests was basically an admission that errors were not just unavoidable. Errors were going to happen so we will build in a way to firewall off many of them by starting from scratch on a regular basis.

Strictly speaking an algorithm that doesn't behave deterministically is, usually, incorrect. The lack of determinism is due to one or more errors (aka bugs) that must be rooted out. At least according to accepted wisdom.

Well it turns out that another way to increase the determinism of the system is to figure out when, on average, the error rate goes unacceptably high. Then start over before getting there. As an aside this strikes me as analogous to the trend in 20th century mathematics to prove the correctness of a given construct in the limit as opposed to correctness always and everywhere.

This is a profound alternative with implications for the design of large software systems. And by large I mean complex since size in-and-of-itself creates a kind of complexity. This alternative, Recovery Oriented Computing, basically acknowledges that at some point the complexity of a software project exceeds some threshold beyond which errors are guaranteed.

Once this is accepted, and it is a painful thing to accept for me at least, all sorts of other often overlooked considerations come to the fore. Things like:

  • How long does this thing take to start after a crash?
  • How easy is it for us to figure out when we should restart to avoid crashing?
  • Maybe we should emphasize architectural principles that optimize towards minimizing the length of time it takes to start (and restart).
  • Ditto for shutdown/stop.

No comments :

Post a Comment