Prof. Douglas Thain: troubleshooting

Showing posts with label troubleshooting. Show all posts

Sunday, February 8, 2009

Fail Fast, Fail Often

A common misconception among programmers is that software should always attempt to hide failures in distributed systems. This idea seems sensible at first, because distributed systems are full of failures of all kinds: machines crash, software fails, and networks go down, just to name a few. If I am writing a function called transfer_file() which copies a file from one place to another, then I should try to connect multiple times and restart failed transfers for about an hour before giving up, right?

It turns out that transparent fault tolerance is exactly the wrong approach in a large, layered software system. Instead, each layer of software should carefully define consistent failure conditions, and then feel free to fail as often as it wants to.

Here is why: If someone else builds an application that calls transfer_file(), the application itself knows a whole lot more about what kind of fault tolerance is needed. It may turn out that the application knows about several file servers, and if one cannot be reached immediately, then another will do just fine. On the other hand, perhaps transfer_file will be used in some batch workload that will run for weeks, so it is vital that the transfer be retried until success.

If you want to build a controllable system, then your building blocks must have very precise failure semantics.. Unfortunately, many system calls have such vague semantics that they are nearly impossible to use correctly in the presence of failures. Consider, for example, the Unix system call connect(), which initiates a TCP connection to a remote host. Here are some possible results from connect():

If the host does not respond to IP traffic, connect() will block for an undetermined amount of time configured by the kernel (anywhere from minutes to hours), and then return ETIMEDOUT.
If a router or switch determines that the host is not routable, then in a few seconds connect() will return with the error EHOSTUNREACH.
If the host is up, but there is no process listening on the port, then connect() will return almost immediately with ECONNREFUSED.

Depending on the precise nature of the failure, the call might return immediately, or it might return after a few hours. And, the distinction between these failure modes hardly matters to the user: in each case, the requested service is simply not available. Imagine trying to build an application that will quickly connect to the first available server, out of three. Yuck.

To get around this, all our software uses an intermediate layer that does a fair amount of work to place consistent failure semantics on system calls. For example, instead of using BSD sockets directly, we have a layer called link with operations like this:

link_connect( address, port, timeout );
link_accept( link, timeout );
link_read( link, buffer, length, timeout );

Inside each of these operations, the library carefully implements the desired failure semantics. If an operation fails quickly, then it is retried (with an exponential backoff) until the timeout as expired. If an operation fails slowly, then it is cleanly aborted when the timeout expires. With these in place, we can build higher level operations that rely on network communication without getting unexpectedly stuck.

Here is an example where precise failure detection really matters. In an earlier post, I wrote about the Wavefront abstraction, which is a distributed computing model with a lot of dependencies. In a Wavefront problem, we must first execute one process in the lower left hand corner. Once that is complete, we can run two adjacent functions, then three, and so on:

If we run a Wavefront on a system of hundreds of processors, then delays are inevitable. What's more, a delay in the computation of any one result slows down the whole system. To avoid this problem, we keep running statistics on the expected computation time of any node, and set timeouts appropriately. If any one computation falls more than a few standard deviations beyond the average, we abort it and try it on another processor. We call this technique "Fast Abort".

Here is the effect of this technique on a large wavefront problem. The X axis shows time, and the Y axis shows the number of tasks currently running. The bottom line shows the reliable technique of waiting and retrying tasks until they succeed. The top line shows what happens with Fast Abort. As you can see, this technique much more rapidly reaches a high degree of parallelism.

The moral of the story is: Make failure conditions an explicit part of your interface. If you make it very clear how and when a call can fail, then it is very easy for applications to implement fault tolerance appropriate to the situation at hand.

Wednesday, January 14, 2009

Audit Trails in Voting Machines

Kim Zetter at Wired magazine recently wrote about the use of log files in electronic voting machines. (It actually shows snippets of the relevant data, which is a refreshing use of primary evidence in journalism.) The article illustrates an often overlooked rule of software engineering:

A DEBUG FILE IS NOT THE SAME THING AS AN AUDIT TRAIL.

Here is my rough guess at what happened: Political forces informed the software company that the voting machines must produce an audit trail. Management instructed the programmers to produce a log file of some kind. The programmers already had some debug log files, so they added a few more printfs, and everyone seemed happy.

As Ms. Zetter explains, election officials attempted to read the audit trail and discovered it was essentially useless. Events were recorded in inconsistent ways. Items were incompletely specified, so the reader couldn't distinguish between deck zero on ballot A and deck zero on ballot B. Data in some messages is just plain wrong. Uncommon but expected events are recorded with scary messages like "Exception!" or "Lost deck!"

The problem is that the programmers created a debug file instead of creating a distinct audit trail. A debug file is a handy tool for recording messages during development. Debug messages are added haphazardly to the code as the programmer works to create and debug tricky bits of code. They are often cryptic or personal, because they are only intended to be read by the person that wrote them. For example, here is a bit of a debug log from the Condor distributed system. Note that ZKM is one of the programmer's names, who put his initials in to make it easy to find his own messages.

1/7 19:42:12 (89102.0) (9319): in pseudo_job_exit: status=0,reason=100
1/7 19:42:12 (89102.0) (9319): rval = 0, errno = 25
1/7 19:42:12 (89102.0) (9319): Shadow: do_REMOTE_syscall returned less than 0
1/7 19:42:12 (89102.0) (9319): ZKM: setting default map to (null)
1/7 19:42:12 (89102.0) (9319): Job 89102.0 terminated: exited with status 0
1/7 19:42:12 (89102.0) (9319): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100

An audit trail is something completely different. It communicates to a third party some essential property of the system, such as number of users served, tax forms in progress, or ballots submitted. If you are creating an audit trail, you must add carefully crafted audit statements to the code in strategic places. Someone other than the programmer is going to read it, so it must be clear, concise, and consistent.

For example, the following is an audit log for users logging in and out of a standard Linux machine:

dthain pts/2 Tue Jan 13 20:56 - 21:23 (00:26)
dthain pts/2 Tue Jan 13 12:53 - 16:22 (03:29)
dthain pts/1 Tue Jan 13 12:52 still logged in

To summarize, a debugging file usually:

has an ad-hoc format to facilitate human consumption.
omits normal events and reports unusual or unexpected events.
does not completely describe the state of the system.

but an audit trail must:

have a well defined format that facilitates automatic analysis.
record every essential event in the system, whether normal or abnormal.
completely describe the essential state of the system at any time.

Now, we may interpret the problem with the voting machines in two different ways.

First, there is an language problem. Political forces used the term "audit trail", but at some step in communication, this was corrupted to "log file". Perhaps the programmers observed that they had a debug file, added a few more printfs, and assumed that the requirement was satisfied. You can see how this accident might have been made in good faith.

Second, there is a serious oversight problem. The purpose of an audit trail is to allow a third party to read the output and draw conclusions about the system. If we only discover that the audit trail is useless after the election, we can only conclude that nobody looked it during testing. If the project managers and their political overseers had demanded to see the so-called audit trail during testing, the entire problem would have been avoided.

So, the two morals of the story for computer programmers are:

A debug file is not the same thing as an audit trail.
Always double check that you got exactly what you asked for.

Monday, October 6, 2008

Troubleshooting Distributed Systems via Data Mining

One of our students, David Cieslak, just presented this paper on troubleshooting large distributed systems at the IEEE Grid Computing Conference in Japan. Here's the situation:

You have a million jobs to run.
You submit them to a system of thousands of CPUs.
Half of them complete, and half of them fail.

Now what do you do? It's hopeless to debug any single failure, because you don't know if it represents the most important case. It's a thankless job to dig through all of the log files of the system to see where things went wrong. You might try re-submitting all of the jobs, but chances are that half of those will fail, and you will just be wasting more time and resources pushing them to the finish.

Typically, these sorts of errors come arise from an incompatibility of one kind or another. The many versions of Linux present the most outrageous examples. Perhaps your job assumes that the program wget can be found on any old Linux machine. Oops! Perhaps your program is dynamically linked against SSL version 2.3.6.5.8.2.1, but some machines only have version 2.3.6.5.8.1.9. Oops! Perhaps your program crashes on a machine with more than 2GB of physical memory, because it performs improper pointer arithmetic. Oops!

So, to address this problem, David has constructed a nice tool that reads in some log files, and then diagnoses the properties of machines or jobs associated with failures, using techniques from the field of data mining. (We implemented this on log files from Condor, but you could apply the principle to any similar system.) Of course, the tool cannot diagnose the root cause, but it can help to narrow down the source of the problem.

For example, consider the user running several thousand jobs on our 700 CPU Condor pool. Jobs tended to fail within minutes on certain set of eleven machines. Of course, as soon as those jobs failed, the machines were free to run more jobs, which promptly failed. By applying GASP, we discovered a common property among those machines:

(TotalVirtualMemory < 1048576)

They only had one gigabyte of virtual memory! (Note: The units are KB.) Whenever a program would consume more than that, it was promptly killed by the operating system. This was simply a mistake made in configuration -- our admins fixed the setting, and the problem went away.

Here's another problem we found on the Wisconsin portion of the Open Science Grid. Processing the log data from 100,000 jobs submitted in 2007, we found that most failures were associated with this property:

(FilesystemDomain="hep.wisc.edu")

It turns out that a large number of users submitted jobs assuming that the filesystem they needed would be mounted on all nodes of the grid. Not so! Since this was an historical analysis, we could not repair the system, but it did give us a rapid understanding of an important usability aspect of the system.

If you want to try this out yourself, you can visit our web page for the software.