Prof. Douglas Thain: 2009

Thursday, October 8, 2009

On Programming With Processes, Part II

One of the biggest challenges in building computer systems is finding a way to make things simpler. Any propeller-head can make a piece of software more complicated. Unfortunately, our industry seems to have a way of gravitating toward the complex. Let's look at the current state of the web browsers -- pick any one -- which seem to insist upon reimplementing or otherwise abusing the operating system.

Exhibit 1: About 2003, tabbed browsing is heralded as the wave of the future, and every web browser re-writes itself from scratch to support tabs and issues gushing press releases. What is a tab? Well, it's a way to switch between multiple running programs, each with its own title and visual space. Which is to say... it's like having windows! Except it's worse than having windows, it's like the old awful Multiple Document Interface, which even Microsoft now admits confused the heck out of everyone.

The funny thing is, you can achieve exactly the same behavior by dragging your taskbar to the top of the screen, like this:

Exhibit 2: You cannot run the latest version of Netscape (a.k.a Mozilla, Firefox, SeaMonkey, IceWeasel, good grief...) if your home directory is on a distributed file system. Never mind that putting your home directory on a shared filesystem is the normal practice in 90% of the industrialized world, where the user of the machine works for an organization that keeps important documents on a central server.

Apparently, Firefox uses an embedded database to store your preferences, bookmarks, cache, etc, and it cannot tolerate multiple simultaneous access. So, if you try to run multiple instances at once, it has to be clever enough to find the running copy and tell it to open a new window. If it cannot find it because the other copy is running in another console or on another machine, you get this ridiculous message:

Exhibit 3: Google Chrome is supposed to be the re-invention of the web browser, except simpler and more robust. Instead of threads, it uses this new-fangled technology called "processes" instead of those old gnarly threads. So far, so good. Then Firefox decides to get on this bandwagon.

Unfortunately, Firefox is missing the point entirely. The plan is to break the UI that controls all the windows into one process, and the plugins, parsers, renderers, etc into separate processes. It should come as no surprise that this makes things even more complicated, because the various pieces have to communicate with each other. More subtly, it makes the failure semantics really strange: if a helper process dies, one window will fail, but if the UI process dies, a whole bunch of windows will fail. If you look at the set of running processes, you are going to see an unpredictable number of processes with names that have no relation to what you are actually doing.

Everyone seems to have missed a ridiculously simple solution to all of these problems: Run each browser window in a separate process. You don't have to separate out all of the complex plugins, renderers, and so forth, because if one crashes, it will only take down that window. Furthermore, to open a new browser page in any context, all you have to do is fork() and exec("browser http://") and the operating system takes care of the rest.

Thursday, October 1, 2009

Partly Cloudy with a Chance of Condor

We have been thinking about cloud computing quite a bit over the last month. As I noted earlier, cloud computing is hardly a new idea, but it does add a few new twists on some old concepts in distributed systems. So, we are spending some time to understand how we can take our existing big applications and make them work with cloud systems and software. It should come as no surprise that there are a number of ways to use Condor to harness clouds for big applications.

Two weeks ago, I gave a talk titled Science in the Clouds at an NSF workshop on Cloud Computing and the Geosciences. One of the points that I made was that although clouds make it easy to allocate new machines that have exactly the environment you want, they don't solve the problem of work management. That is, if you have one million tasks to do, how do you reliably distribute them between your workstation, your campus computer center, and your cloud workforce? For this, you need some kind of job execution system, which is largely what grid computing has focused on:

As it stands, Condor is pretty good at managing work across multiple different kinds of systems. In fact, today you can go to a commercial service like Cycle Computing, who can build an on-demand Condor pool by allocating machines from Amazon:

Just today, we hosted Dhruba Borthakur at Notre Dame. Dhruba is the project lead for the open source Apache Hadoop system. We are cooking up some neat ways for Condor and Hadoop to play together. As a first step, one of my students Peter Bui has cooked up a module for Parrot that talks to HDFS, the Hadoop file system. This allows any Unix program -- not just Java -- talk to HDFS, without requiring the kernel configuration and other headaches of using FUSE. Then, you can submit your jobs into a Condor pool and allow them to access data in HDFS as if it were a local file system. The next step is to co-locate the Condor jobs with the Hadoop data that they want to access.

Finally, if you are interested in cloud computing, you should attend CCA09 - Cloud Computing and Applications - to be held in Chicago on October 20th. This will be a focused, one day meeting with speakers from industry, academia who are both building and using cloud computers.

Monday, August 3, 2009

REU Project: BXGrid

This post continues last week's subject of summer REU projects.

Rachel Witty and Kameron Srimoungchanh worked on BXGrid, our web portal and computing system for biometrics research. This project is a collaboration between the Cooperative Computing Lab and the Computer Vision Research Lab at Notre Dame. Hoang Bui is the lead graduate student on the project. Rachel and Kameron added a bunch of new capabilities to the system; I'll show three examples today.

The first is the ability to handle 3-D face scans taken by a specialized camera equipped with a laser rangefinder. The still picture here doesn't quite do it justice, because each white "mask" on the left is a rotating animation of the face. By integrating this data into BXGrid, the 3-D data can be validated against previous ordinary images of the face.

I previously discussed All-Pairs problems, which are common in biometrics. While we already had the ability to run very large All-Pairs, problems, we never had the capability to view the results easily. Now, with the click of a button, you can set up a small All-Pairs problem and view the results on the portal:

Currently, new data ingested into the system is validated manually by people who must visually check that a eye, face, or whatever matches existing data in the system. Although this can be divided up among a large team of people, it is still time consuming and error prone.

Kameron and Rachel built a system that does a first pass at this task automatically. Using Makeflow, they set up a system to export all newly ingested images along with five good images that should match. This results in thousands of jobs sent to our Condor pool, which transform and compare the images. When all the results come back, you get a nice web page that summarizes the images and the results:

This research was supported in part by the National Science Foundation via grant NSF-CCF-0621434.

Tuesday, July 28, 2009

REU Project: Biocompute

This summer, we hosted four REU students who contributed to two web portals for distributed computing: Biocompute and BXGrid. I'll write about one this week and the other next week.

REU students Ryan Jansen and Joey Rich worked with recent grad Rory Carmichael on Biocompute, our web portal and computing system for bioinformatics research. Biocompute was originally created by Patrick Braga-Henebry for his B.S. honors thesis, and we are now putting it into production in collaboration with the Bioinformatics Core Facility at Notre Dame.

Biocompute allows researchers at Notre Dame to run standard bioinformatics tools like BLAST, and then share and manage the results. The new twist is that we transparently parallelize the tasks and run them on our campus Condor pool. This allows people to run tasks that were previously impossible: we routinely run workloads that would take months on a single machine, but get completed in hours on Biocompute.

The user simply fills out a form specifying the query, genomic databases, and so forth:

Biocompute transforms the request into a large Makeflow job that looks like this:

Users and administrators can view the progress of each job:

When the task is complete, you can browse the results, download them, or feed them into another tool on the web site:

This work was sponsored in part by the Bioinformatics Core Facility and the National Science Foundation under grant NSF-06-43229.

Friday, July 3, 2009

Make as an Abstraction for Distributed Computing

In previous articles, I have introduced the idea of abstractions for distributed computing. An abstraction is a way of specifying a large amount of work in a way that makes it possible to be distributed across a large computing system. All of the abstractions I have discussed so far have a compact, regular structure.

However, many people have large workloads that do not have a regular structure. They may have one program that generates three output files, each consumed by another program, and then joined back together. You can think of these workloads as a directed graph of processes and files, like the figure to the right.

If each of these programs may run for a long time, then you need a workflow engine that will keep track of the state of the graph, submit the jobs for execution, and deal with failures. There exist a number of workflow engines today, but without naming names, they aren't exactly easy to use. The workflow designer has to write a whole lot of batch scripts, or XML, or learn a rather complicated language to put it all together.

We recently wondered if there was a simpler way to accomplish this, A good workflow language should make it easy to do simple things, and at least obvious (if not easy) how to specify complex things. For implementation reasons, a workflow language needs to clearly state the data needs of an application: if we know in advance that program A needs file X, then we can deliver it efficiently before A begins executing. If possible, it shouldn't require the user to know anything about the underlying batch or grid systems.

After scratching our heads for a while, we finally came to the conclusion that good old Make is an attractive worfklow language. It is very compact, it states data dependencies nicely, and lots of people already know it. So, we have built a workflow system called Makeflow, which takes Makefiles, and runs them on parallel and distributed systems. Using Makeflow, you can take a very large workflow and run it on your local workstation, a single 32-core server, or a 1000-node Condor pool.

What makes Makeflow different from previous distributed makes is that is does not rely on a distributed file system. Instead, it uses the dependency information already present in the Makefile to send data to remote jobs. For example, if you have a rule like this:

output.data final.state : input.data mysim.exe
./mysim.exe -temp 325 input.data

then Makeflow will ensure that the input files input.data and mysim.exe are placed at the worker node before running mysim.exe. Afterwards, Makeflow brings the output files back to the initiator.

Because of this property, you don't need a data center in order to run a Makeflow. We provide a simple process called worker that you can run on your desktop, your laptop, or any other old computers you have lying around. The workers call home to Makeflow, which coordinates the execution on whatever machines you have available.

You can download and try out Makeflow yourself from the CCL web site.

Tuesday, June 2, 2009

Grid Heating: Putting Data Center Heat to Productive Use

Dr. Paul Brenner, a research scientist in the Computing Research Center at the University Notre Dame, has been advocating a novel idea called grid heating. He recently won a "Green IT Award" from the Uptime Institute for his work. Here is a short introduction to the idea:

Around the world, large data centers consume enormous amounts of power. In addition to the energy needed to spin disks and rearrange electrons, an approximately equal amount of power is needed to run the air conditioners and fans to remove that heat from the data center. In this sense, data centers are doubly inefficient, because they are using power to both heat and cool the same space. If we could put that heat to productive use, then we could save energy on cooling the data center, as well as save energy that would have otherwise been used to generate heat.

Last year. Dr. Brenner constructed a prototype of this idea at the city greenhouse in South Bend, which was struggling with enormous heating bills during the winter. He constructed a small cluster, and placed it in the Arizona Desert display in the greenhouse, where the plants need the highest temperature. Notre Dame paid the electricity bill, the greenhouse got the benefit of the heat, and the computers simply joined our campus Condor pool. Everybody wins, and nobody has to pay an air conditioning bill.

However, the first cluster was just a prototype, and couldn't generate nearly enough heat for the entire greenhouse. So, this year, Dr. Brenner is building a small data center in a modular shipping container. next to the greenhouse. With a new electricity and network hookup, the data center will run several hundred CPUs, and function as a secondary furnace for the facility, hopefully reducing the heating bill by half over the winter.

The new facility will significantly add to our campus grid, and will also give us some interesting scheduling problems to work on. The greenhouse needs heat the most during the winter, and to a lesser extent during the summer, so the computing capacity of the system will change with the seasons. Further, the price of electricity varies significantly during the day, so jobs run in the dead of night may be cheaper than those run during the day. If we can connect our "campus grid" to the "smart electric grid", we can make the system automatically schedule around these constraints.

Here are some recent articles about Grid Heating:

Friday, May 29, 2009

Dynamic Linking and Distributed Computing Don't Mix

Dynamic linking is one of the more frustrating aspects of distributed computing in the real world. It's is the sort of technology that is meant to optimize the computer's happiness at the expense of the end user's sanity. Dynamic linking should really be avoided, except in a few very specific cases outlined below.

For those of you who don't remember, here is a brief primer on linking:

Back in the good old days, programmers would group commonly used functions (like printf and strlen) into a common module, otherwise known as a library. However, managing the library was difficult. If you simply compiled your library into the program, it would work, but your program would be full of unused code. The alternative was to cut-and-paste the needed routines into your program, but this was time consuming, and led to many copies of the code that were difficult to synchronize. Frustration was the result.

The solution to this is a tool known as a link editor or just linker. A linker looks at a program and a set of libraries, figures out all the pieces that are needed, and then constructs a complete executable program with only the routines that are actually needed. In the example below, suppose that main.o needs to use the functions printf.o and baz.o. The linker figures out that those reside in libc.a and libstrange.a, and puts the whole thing together in prog.exe. This program can be copied to any other machine, and will run correctly. This is now known as static linking.

As machines grew larger, and had ever more programs and libraries installed, someone clever observed an inefficiency. Nearly every program requires printf, so a copy of the printf code was present in nearly every single program, wasting space in both the filesystem and virtual memory. Further, if someone fixed a bug or security flaw in printf, it was necessary to recompile everything.

To address these problems, dynamic linking was invented. In this model, the linker does not copy routines into the executable, it simply makes a note that the program depends upon a certainly library. When the program is actually run, the loader binds the function calls in the program to the shared libraries on disk. Often, the executable program is very small, and simply consists of a few calls to a large number of libraries.

Now enter distributed systems. Suppose that you wish to take a program that you have written on one machine, and run it on another machine? If you have employed static linking, it's easy: you simply copy the program over, and run it. If you have used dynamic linking, it's a real pain: you must identify all of the libraries that the program depends upon, copy them over, set some obscure environment variables, and then run the program.

Ironically, dynamic linking is less efficient than static linking in several ways. First, it actually ends up using more disk space, virtual memory, and network traffic, because you have to copy over the entire libraries, not just the parts that your program needs. (Of course, you can break the dynamic library up into smaller libraries, but then you are just making it harder on the programmer and user to identify the right libraries.) Second, it makes program startup very slow, especially on a distributed filesystem, because the loader must search for every single library in the search path.

For a nice example of how this can make a simple program ridiculously complicated, try the following two commands on Linux: ldd /bin/ls and strace /bin/ls . The former shows the libraries required to run the ls command, and the latter shows the hundreds of system calls needed to just start the program. Of course, a few hundred system calls isn't much by itself, but when you think of hundreds of users sharing a common file server, and ever call to exec() results in this traffic, you can start to see why this might not be a good idea.

So, to sum up:

	Static Linking	Dynamic Linking
On A Single Computer	Easy to use. Wastes space.	Easy to use. Saves space.
In a Distributed System	Easy to use. Saves space.	Hard to use. Wastes space.

My advice? Always use static linking, unless you are 100% sure that every single computer on the planet has the libraries that you need. That means, link dynamically against the standard C and math libraries, maybe against pthreads and X11, and statically against everything else.

Appendix: How to control linking with gcc.

To link everything in your program statically, use the -static flag:
gcc -static main.o -lstrange -lc -lm -o prog.exe

To link some libraries statically and some dynamically, use -Xlinker -Bdynamic and -Xlinker -Bstatic to switch between modes:
gcc main.o -Xlinker -Bstatic -lstrange -Xlinker -Bdynamic -lc -lm -o prog.exe

To see what dynamic libraries your program depends upon, use the ldd command:
% ldd /bin/ls
libc.so.6 => /lib/tls/libc.so.6 (0x00a99000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00cf8000)
/lib/ld-linux.so.2 (0x00a7f000)

Tuesday, April 14, 2009

Distributed Genome Assembly on 1000 Computers

Lately, my research group has been collaborating with Prof. Scott Emrich on several problems in bioinformatics. Our students Chris Moretti and Mike Olson have been building a system for carrying out whole-genome assembly problems on campus grids. They recently got it scaled up to run on nearly 1000 nodes spread across Notre Dame, Purdue, and Wisconsin, making the problem complete in a few hours instead of a few weeks. We are excited to move the system into production use to start working on some real assembly problems.

Here is what the genome assembly problem looks like from a computer science perspective. As you should remember from biology class, your entire genetic makeup is encoded into a long string of DNA, which is a chemical sequence of base pairs that we represent by the letters A, T, C, and G. A sequencing device takes a biological sample, and through some chemical manipulations can extract the DNA and produce your entire string of DNA, which is some 2 billion characters (bases) long:

AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA

Except that it isn't that simple. The chemical process of extracting the sequence runs out of energy after about 100-1000 characters. depending on the exact process in use. Instead what you end up with is a large set of "reads" which are random substrings from the entire genome. For example, here are three random substrings of the previous string:

1. ATCCTAGCTAGCTACGA

2. AGTCGATCGATCG

3. CGATCGATAATCGATCCTAG

Now, you have to examine all of the reads, and figure out which ones overlap. In principle, you want to compare all of them to each other with the All-Pairs framework, but that would be computationally infeasible. Instead, there are a number of heuristics that can be used to generate candidate pairs, which then can be matched in detail and then assembled. For example, the three reads from before overlap like this:

AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA

.....................................

AGTCGATCGATCG........................

.......CGATCGATAATCGATCCTAG..........

....................ATCCTAGCTAGCTACGA

There are many wide open questions of exactly what heuristics to use in selecting candidates, performing alignments, and completing the assembly. Our job is to give researchers a modular framework that allows them to try many different kinds of algorithms, using hundreds or thousands of CPUs to complete the job quickly.

We started with the work queue framework from the Wavefront abstraction. An assembly master process reads the candidates and sequences from disk, builds small units of work, and sends them out to worker processes running on various grids. No particular alignment code is baked into the system. Instead, the user provides an alignment program written in whatever language they find convenient. The system moves the executable and the necessary files out to the execution node, and puts it to work.

Here is an example of the system in action on a multi-institutional grid. The X axis shows time, and the various lines show number of tasks running (red), percent complete (blue), and cumulative speedup (green). We started by running a worker on one workstation, then another, then on a 32-node cluster, then on the Notre Dame campus grid, then on Condor pools at Purdue and Wisconsin, growing up to nearly 700 CPUs total. About halfway through, we forced a failure by unplugging the workstation running the master. Upon restarting, the master loaded the completed results, and picked up right where it left off.

I'm looking forward to putting our system into a production mode and attacking some really big problems.

Wednesday, February 25, 2009

On Parallel Programming with Processes

About once a week, a well-meaning person stops by my office to ask a question like this:

I need to run about 1000 simulations that take about an hour each. I can't wait a thousand hours for the results, so I need to parallelize my simulation. So, should I re-write my application using threads, MPI, or something else?

For some reason, they are always disappointed by my response:

Just run multiple copies of your program at once.

The reasoning is very simple. You already have a complete, debugged program. You have multiple processors, and your operating system knows how to use them. Running four processes at once on a four CPU machine will give you four times the number of results in the same amount of time. Your work will be down in 250 hours instead of 1000. In fact, you can take the same sequential program and submit it to a large batch system that can run on 100 different processors at once and complete one hundred simulations in one hour. If you only get 99 hosts, that's ok, you will still get a 99x improvement.

The alternative is almost too awful to contemplate. Those who have written multithreaded or message passing programs knows that it sounds great on the chalkboard, but the reality is much more complicated. Debugging tools are ineffective on parallel programs. Many existing libraries are not thread safe. You have to deal with synchronization problems, and an endless slew of tuning parameters. If you write a message passing program that requires eight hosts, then you need to wait until you have exactly eight hosts available for your exclusive use. It is all too likely that you will spend more time trying to correct the program than you actually will running it.

The funny part is, many people do not like this advice. But... that's not... parallel! Or, if they concede it's parallel, it's merely embarassingly parallel, or even worse, shamefully parallel. (As if using 100 CPUs simultaneously with processes was somehow less parallel than using 8 CPUs with threads.) They were hoping to doing some really difficult, macho programming, but now find there is a simple solution to the problem.

Now, I'm over-simplifying a little bit. There are certainly cases where it makes sense to take an existing program and parallelize it to use multiple processors. There are a few good reasons for doing so. First, if you really need one particular result as soon as possible, then it makes sense to parallelize. For example, if you are predicting tomorrow's weather, you need the result before tomorrow. Second, if your sequential program has fully consumed another resource on the machine, then it may make sense to parallelize. For example, if your simulation uses all available memory on a machine, then you cannot run two copies at once on the same machine. Third, if one program will run for longer than you can realistically keep a computer up without rebooting, then it may make sense to parallelize. However, none of these cases are as common as you might think, and it's usually best to avoid threads or message passing until the necessity has been proven.

A middle ground that we have been exploring in my lab is the use of abstractions to represent large parallel programs. In this approach, the user provides a sequential program that performs the key kernel of computation in which they specialize. Many invocations of the kernel are then combined together to run very large parallel programs with parallelism measured in hundreds of CPUs. You can read more about the BXGrid, Wavefront, All-Pairs, and Classify abstractions.

Saturday, February 21, 2009

Exponential Backoff in Distributed Systems

In response to my previous article, a commenter asked:

Why exponential backoff? To put a finer point on the question, How should I choose the parameters for my exponential backoff algorithm? I think many people choose parameters that back off too much, too fast.

The idea of exponential backoff in distributed systems goes back quite a few years. An early example can be found in the Ethernet network. In its original form, an Ethernet consisted of a single cable connecting all stations on the network. Unlike some other computer networks at the time, it had no direct means of controlling which station could transmit at any time. If one station transmitted while everyone else was silent, then the message would be received by all stations. But, if two (or more) transmitted at once, every station would receive a corrupted message.

Here's an analogy. Imagine a school gymnasium with people lined up along the walls. People have to shout to be heard, and there are multiple conversations going on at once. As you probably know from experience, this can only work if one person speaks at a time. So, each person waits for a quiet moment to speak. Occasionally, two people try to speak simultaneously, and then you have a silly game of each waiting a bit and then trying again until the tie is broken.

That is essentially how Ethernet works. Each party that wants to transmit waits for a quiet moment, and then sends a message. The sender also simultaneously listens to see if it can hear its own message. If the message is corrupted, it means another party transmitted at the same time, so both wait a bit and try again.

The essential question is: How long should each station wait?

It does no good to have each party wait a fixed amount of time -- say, one microsecond -- because then each will try again at the same time, and the situation repeats forever. A better idea is to choose a random time -- say, between one and ten microseconds, which will break the tie in a small number of attempts. However, if many parties are trying to talk at once, the result will still be a chaotic mess of messages, with no-one making any progress.

A more robust solution is for each party to use exponentially increasing delays. For example, delay one microsecond plus a random factor the first time, then two, then four, and so on. This solution works regardless of the number of competing parties, because it tends to thin the traffic out over time until the congestion is eased.

I wrote a paper titled The Ethernet Approach to Grid Computing on this topic a few years back, making the observation that this strategy is needed everywhere in distributed systems. Whenever you talk to a file server, a batch system, a print server, or file your taxes online, failures are possible, so you need to use Ethernet-like strategies. To encourage this, I wrote a simple language called the Fault Tolerant Shell which looks a lot like a conventional shell with exceptions. For example, here is how to reliably submit a Condor job:

try for 1 hour

condor_submit job.file

end

Or, if you have a choice of three different places to fetch a file from:

forany server in X, Y, Z
wget http://$server/myfile
end

Internally, the shell takes care of all of the error detection, retries, and so forth, so that the programmer can concentrate on the essential issues. The end result is that the system becomes much more robust to load bursts. For example, the following graph shows the performance of many clients submitting batch jobs to a queue using three methods: the Ethernet approach, the Aloha approach (an intermediate step), and a simple fixed retry:

As you can see, the fixed approach crashes to zero after about 400 clients, whereas the Ethernet approach continues to maintain a high level of throughput. It is not as high as the performance under low load, but it is relatively stable over a wide range of load.

The disadvantage to using exponential backoff is that it is going to extend the time to recovery after a failure by about a factor of two. Suppose that you are a client talking to a web server which crashes. You wait one second, try again, then two seconds, and so on. If the web server is unavailable for thirty seconds and then recovers, the client will not notice right away, because it will be in the middle of waiting for thirty seconds before trying again. Now, extending a thirty second outage to a sixty second outage is unlikely to cause any real heartache. But, what about extending a thirty minutes to sixty minutes? That could be irate customer territory.

So, you need to balance the needs of your customers against the capacity of your system. If you you want to handle 1000 clients and have a maximum recovery-after-failure time of one second, then you had better make sure that your system can handle 1000 failed requests per second at a sustained rate. That may sound easy, but if each failed request involves a database query, a write to a log file, and an email to an administrator, then you will be quickly overwhelmed.

Now let's answer the original question: How should I pick the backoff parameters?

Let's assume that they delay chosen at any point is based on an initial timeout (T), an exponential factor (F), the number of retries so far (N), a random number (R), and a maximum timeout (M). Then:

delay = MIN( R * T * F ^ N , M )

R should be a random number in the range [1-2], so that its effect is to spread out the load over time, but always more conservative than plain backoff.
T is the initial timeout, and should be set at the outer limits of expected response time for the service. For example, if your service responds in 1ms on average but in 10ms for 99% of requests, then set t=10ms.
F doesn't matter much, so choose 2 as a nice round number. (It's the exponential nature that counts.)
M should be as low as possible to keep your customers happy, but high enough that the system can definitely handle requests from all clients at that sustained rate.

Sunday, February 8, 2009

Fail Fast, Fail Often

A common misconception among programmers is that software should always attempt to hide failures in distributed systems. This idea seems sensible at first, because distributed systems are full of failures of all kinds: machines crash, software fails, and networks go down, just to name a few. If I am writing a function called transfer_file() which copies a file from one place to another, then I should try to connect multiple times and restart failed transfers for about an hour before giving up, right?

It turns out that transparent fault tolerance is exactly the wrong approach in a large, layered software system. Instead, each layer of software should carefully define consistent failure conditions, and then feel free to fail as often as it wants to.

Here is why: If someone else builds an application that calls transfer_file(), the application itself knows a whole lot more about what kind of fault tolerance is needed. It may turn out that the application knows about several file servers, and if one cannot be reached immediately, then another will do just fine. On the other hand, perhaps transfer_file will be used in some batch workload that will run for weeks, so it is vital that the transfer be retried until success.

If you want to build a controllable system, then your building blocks must have very precise failure semantics.. Unfortunately, many system calls have such vague semantics that they are nearly impossible to use correctly in the presence of failures. Consider, for example, the Unix system call connect(), which initiates a TCP connection to a remote host. Here are some possible results from connect():

If the host does not respond to IP traffic, connect() will block for an undetermined amount of time configured by the kernel (anywhere from minutes to hours), and then return ETIMEDOUT.
If a router or switch determines that the host is not routable, then in a few seconds connect() will return with the error EHOSTUNREACH.
If the host is up, but there is no process listening on the port, then connect() will return almost immediately with ECONNREFUSED.

Depending on the precise nature of the failure, the call might return immediately, or it might return after a few hours. And, the distinction between these failure modes hardly matters to the user: in each case, the requested service is simply not available. Imagine trying to build an application that will quickly connect to the first available server, out of three. Yuck.

To get around this, all our software uses an intermediate layer that does a fair amount of work to place consistent failure semantics on system calls. For example, instead of using BSD sockets directly, we have a layer called link with operations like this:

link_connect( address, port, timeout );
link_accept( link, timeout );
link_read( link, buffer, length, timeout );

Inside each of these operations, the library carefully implements the desired failure semantics. If an operation fails quickly, then it is retried (with an exponential backoff) until the timeout as expired. If an operation fails slowly, then it is cleanly aborted when the timeout expires. With these in place, we can build higher level operations that rely on network communication without getting unexpectedly stuck.

Here is an example where precise failure detection really matters. In an earlier post, I wrote about the Wavefront abstraction, which is a distributed computing model with a lot of dependencies. In a Wavefront problem, we must first execute one process in the lower left hand corner. Once that is complete, we can run two adjacent functions, then three, and so on:

If we run a Wavefront on a system of hundreds of processors, then delays are inevitable. What's more, a delay in the computation of any one result slows down the whole system. To avoid this problem, we keep running statistics on the expected computation time of any node, and set timeouts appropriately. If any one computation falls more than a few standard deviations beyond the average, we abort it and try it on another processor. We call this technique "Fast Abort".

Here is the effect of this technique on a large wavefront problem. The X axis shows time, and the Y axis shows the number of tasks currently running. The bottom line shows the reliable technique of waiting and retrying tasks until they succeed. The top line shows what happens with Fast Abort. As you can see, this technique much more rapidly reaches a high degree of parallelism.

The moral of the story is: Make failure conditions an explicit part of your interface. If you make it very clear how and when a call can fail, then it is very easy for applications to implement fault tolerance appropriate to the situation at hand.

Wednesday, January 14, 2009

Audit Trails in Voting Machines

Kim Zetter at Wired magazine recently wrote about the use of log files in electronic voting machines. (It actually shows snippets of the relevant data, which is a refreshing use of primary evidence in journalism.) The article illustrates an often overlooked rule of software engineering:

A DEBUG FILE IS NOT THE SAME THING AS AN AUDIT TRAIL.

Here is my rough guess at what happened: Political forces informed the software company that the voting machines must produce an audit trail. Management instructed the programmers to produce a log file of some kind. The programmers already had some debug log files, so they added a few more printfs, and everyone seemed happy.

As Ms. Zetter explains, election officials attempted to read the audit trail and discovered it was essentially useless. Events were recorded in inconsistent ways. Items were incompletely specified, so the reader couldn't distinguish between deck zero on ballot A and deck zero on ballot B. Data in some messages is just plain wrong. Uncommon but expected events are recorded with scary messages like "Exception!" or "Lost deck!"

The problem is that the programmers created a debug file instead of creating a distinct audit trail. A debug file is a handy tool for recording messages during development. Debug messages are added haphazardly to the code as the programmer works to create and debug tricky bits of code. They are often cryptic or personal, because they are only intended to be read by the person that wrote them. For example, here is a bit of a debug log from the Condor distributed system. Note that ZKM is one of the programmer's names, who put his initials in to make it easy to find his own messages.

1/7 19:42:12 (89102.0) (9319): in pseudo_job_exit: status=0,reason=100
1/7 19:42:12 (89102.0) (9319): rval = 0, errno = 25
1/7 19:42:12 (89102.0) (9319): Shadow: do_REMOTE_syscall returned less than 0
1/7 19:42:12 (89102.0) (9319): ZKM: setting default map to (null)
1/7 19:42:12 (89102.0) (9319): Job 89102.0 terminated: exited with status 0
1/7 19:42:12 (89102.0) (9319): **** condor_shadow (condor_SHADOW) EXITING WITH STATUS 100

An audit trail is something completely different. It communicates to a third party some essential property of the system, such as number of users served, tax forms in progress, or ballots submitted. If you are creating an audit trail, you must add carefully crafted audit statements to the code in strategic places. Someone other than the programmer is going to read it, so it must be clear, concise, and consistent.

For example, the following is an audit log for users logging in and out of a standard Linux machine:

dthain pts/2 Tue Jan 13 20:56 - 21:23 (00:26)
dthain pts/2 Tue Jan 13 12:53 - 16:22 (03:29)
dthain pts/1 Tue Jan 13 12:52 still logged in

To summarize, a debugging file usually:

has an ad-hoc format to facilitate human consumption.
omits normal events and reports unusual or unexpected events.
does not completely describe the state of the system.

but an audit trail must:

have a well defined format that facilitates automatic analysis.
record every essential event in the system, whether normal or abnormal.
completely describe the essential state of the system at any time.

Now, we may interpret the problem with the voting machines in two different ways.

First, there is an language problem. Political forces used the term "audit trail", but at some step in communication, this was corrupted to "log file". Perhaps the programmers observed that they had a debug file, added a few more printfs, and assumed that the requirement was satisfied. You can see how this accident might have been made in good faith.

Second, there is a serious oversight problem. The purpose of an audit trail is to allow a third party to read the output and draw conclusions about the system. If we only discover that the audit trail is useless after the election, we can only conclude that nobody looked it during testing. If the project managers and their political overseers had demanded to see the so-called audit trail during testing, the entire problem would have been avoided.

So, the two morals of the story for computer programmers are:

A debug file is not the same thing as an audit trail.
Always double check that you got exactly what you asked for.

Prof. Douglas Thain