Friday, July 3, 2009

Make as an Abstraction for Distributed Computing

In previous articles, I have introduced the idea of abstractions for distributed computing. An abstraction is a way of specifying a large amount of work in a way that makes it possible to be distributed across a large computing system. All of the abstractions I have discussed so far have a compact, regular structure.


However, many people have large workloads that do not have a regular structure. They may have one program that generates three output files, each consumed by another program, and then joined back together. You can think of these workloads as a directed graph of processes and files, like the figure to the right.

If each of these programs may run for a long time, then you need a workflow engine that will keep track of the state of the graph, submit the jobs for execution, and deal with failures. There exist a number of workflow engines today, but without naming names, they aren't exactly easy to use. The workflow designer has to write a whole lot of batch scripts, or XML, or learn a rather complicated language to put it all together.

We recently wondered if there was a simpler way to accomplish this, A good workflow language should make it easy to do simple things, and at least obvious (if not easy) how to specify complex things. For implementation reasons, a workflow language needs to clearly state the data needs of an application: if we know in advance that program A needs file X, then we can deliver it efficiently before A begins executing. If possible, it shouldn't require the user to know anything about the underlying batch or grid systems.

After scratching our heads for a while, we finally came to the conclusion that good old Make is an attractive worfklow language. It is very compact, it states data dependencies nicely, and lots of people already know it. So, we have built a workflow system called Makeflow, which takes Makefiles, and runs them on parallel and distributed systems. Using Makeflow, you can take a very large workflow and run it on your local workstation, a single 32-core server, or a 1000-node Condor pool.

What makes Makeflow different from previous distributed makes is that is does not rely on a distributed file system. Instead, it uses the dependency information already present in the Makefile to send data to remote jobs. For example, if you have a rule like this:

output.data final.state : input.data mysim.exe
./mysim.exe -temp 325 input.data


then Makeflow will ensure that the input files input.data and mysim.exe are placed at the worker node before running mysim.exe. Afterwards, Makeflow brings the output files back to the initiator.

Because of this property, you don't need a data center in order to run a Makeflow. We provide a simple process called worker that you can run on your desktop, your laptop, or any other old computers you have lying around. The workers call home to Makeflow, which coordinates the execution on whatever machines you have available.


You can download and try out Makeflow yourself from the CCL web site.

Tuesday, June 2, 2009

Grid Heating: Putting Data Center Heat to Productive Use

Dr. Paul Brenner, a research scientist in the Computing Research Center at the University Notre Dame, has been advocating a novel idea called grid heating. He recently won a "Green IT Award" from the Uptime Institute for his work. Here is a short introduction to the idea:

Around the world, large data centers consume enormous amounts of power. In addition to the energy needed to spin disks and rearrange electrons, an approximately equal amount of power is needed to run the air conditioners and fans to remove that heat from the data center. In this sense, data centers are doubly inefficient, because they are using power to both heat and cool the same space. If we could put that heat to productive use, then we could save energy on cooling the data center, as well as save energy that would have otherwise been used to generate heat.


Last year. Dr. Brenner constructed a prototype of this idea at the city greenhouse in South Bend, which was struggling with enormous heating bills during the winter. He constructed a small cluster, and placed it in the Arizona Desert display in the greenhouse, where the plants need the highest temperature. Notre Dame paid the electricity bill, the greenhouse got the benefit of the heat, and the computers simply joined our campus Condor pool. Everybody wins, and nobody has to pay an air conditioning bill.



However, the first cluster was just a prototype, and couldn't generate nearly enough heat for the entire greenhouse. So, this year, Dr. Brenner is building a small data center in a modular shipping container. next to the greenhouse. With a new electricity and network hookup, the data center will run several hundred CPUs, and function as a secondary furnace for the facility, hopefully reducing the heating bill by half over the winter.



The new facility will significantly add to our campus grid, and will also give us some interesting scheduling problems to work on. The greenhouse needs heat the most during the winter, and to a lesser extent during the summer, so the computing capacity of the system will change with the seasons. Further, the price of electricity varies significantly during the day, so jobs run in the dead of night may be cheaper than those run during the day. If we can connect our "campus grid" to the "smart electric grid", we can make the system automatically schedule around these constraints.

Here are some recent articles about Grid Heating:

Friday, May 29, 2009

Dynamic Linking and Distributed Computing Don't Mix



Dynamic linking is one of the more frustrating aspects of distributed computing in the real world. It's is the sort of technology that is meant to optimize the computer's happiness at the expense of the end user's sanity. Dynamic linking should really be avoided, except in a few very specific cases outlined below.

For those of you who don't remember, here is a brief primer on linking:

Back in the good old days, programmers would group commonly used functions (like printf and strlen) into a common module, otherwise known as a library. However, managing the library was difficult. If you simply compiled your library into the program, it would work, but your program would be full of unused code. The alternative was to cut-and-paste the needed routines into your program, but this was time consuming, and led to many copies of the code that were difficult to synchronize. Frustration was the result.

The solution to this is a tool known as a link editor or just linker. A linker looks at a program and a set of libraries, figures out all the pieces that are needed, and then constructs a complete executable program with only the routines that are actually needed. In the example below, suppose that main.o needs to use the functions printf.o and baz.o. The linker figures out that those reside in libc.a and libstrange.a, and puts the whole thing together in prog.exe. This program can be copied to any other machine, and will run correctly. This is now known as static linking.

As machines grew larger, and had ever more programs and libraries installed, someone clever observed an inefficiency. Nearly every program requires printf, so a copy of the printf code was present in nearly every single program, wasting space in both the filesystem and virtual memory. Further, if someone fixed a bug or security flaw in printf, it was necessary to recompile everything.

To address these problems, dynamic linking was invented. In this model, the linker does not copy routines into the executable, it simply makes a note that the program depends upon a certainly library. When the program is actually run, the loader binds the function calls in the program to the shared libraries on disk. Often, the executable program is very small, and simply consists of a few calls to a large number of libraries.


Now enter distributed systems. Suppose that you wish to take a program that you have written on one machine, and run it on another machine? If you have employed static linking, it's easy: you simply copy the program over, and run it. If you have used dynamic linking, it's a real pain: you must identify all of the libraries that the program depends upon, copy them over, set some obscure environment variables, and then run the program.

Ironically, dynamic linking is less efficient than static linking in several ways. First, it actually ends up using more disk space, virtual memory, and network traffic, because you have to copy over the entire libraries, not just the parts that your program needs. (Of course, you can break the dynamic library up into smaller libraries, but then you are just making it harder on the programmer and user to identify the right libraries.) Second, it makes program startup very slow, especially on a distributed filesystem, because the loader must search for every single library in the search path.

For a nice example of how this can make a simple program ridiculously complicated, try the following two commands on Linux: ldd /bin/ls and strace /bin/ls . The former shows the libraries required to run the ls command, and the latter shows the hundreds of system calls needed to just start the program. Of course, a few hundred system calls isn't much by itself, but when you think of hundreds of users sharing a common file server, and ever call to exec() results in this traffic, you can start to see why this might not be a good idea.

So, to sum up:

Static LinkingDynamic Linking
On A Single
Computer
Easy to use.
Wastes space.
Easy to use.
Saves space.
In a Distributed
System

Easy to use.
Saves space.

Hard to use.
Wastes space.


My advice? Always use static linking, unless you are 100% sure that every single computer on the planet has the libraries that you need. That means, link dynamically against the standard C and math libraries, maybe against pthreads and X11, and statically against everything else.



Appendix: How to control linking with gcc.


To link everything in your program statically, use the -static flag:
gcc -static main.o -lstrange -lc -lm -o prog.exe

To link some libraries statically and some dynamically, use -Xlinker -Bdynamic and -Xlinker -Bstatic to switch between modes:
gcc main.o -Xlinker -Bstatic -lstrange -Xlinker -Bdynamic -lc -lm -o prog.exe

To see what dynamic libraries your program depends upon, use the ldd command:
% ldd /bin/ls
libc.so.6 => /lib/tls/libc.so.6 (0x00a99000)
libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00cf8000)
/lib/ld-linux.so.2 (0x00a7f000)

Tuesday, April 14, 2009

Distributed Genome Assembly on 1000 Computers

Lately, my research group has been collaborating with Prof. Scott Emrich on several problems in bioinformatics. Our students Chris Moretti and Mike Olson have been building a system for carrying out whole-genome assembly problems on campus grids. They recently got it scaled up to run on nearly 1000 nodes spread across Notre Dame, Purdue, and Wisconsin, making the problem complete in a few hours instead of a few weeks. We are excited to move the system into production use to start working on some real assembly problems.

Here is what the genome assembly problem looks like from a computer science perspective. As you should remember from biology class, your entire genetic makeup is encoded into a long string of DNA, which is a chemical sequence of base pairs that we represent by the letters A, T, C, and G. A sequencing device takes a biological sample, and through some chemical manipulations can extract the DNA and produce your entire string of DNA, which is some 2 billion characters (bases) long:

AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA

Except that it isn't that simple. The chemical process of extracting the sequence runs out of energy after about 100-1000 characters. depending on the exact process in use. Instead what you end up with is a large set of "reads" which are random substrings from the entire genome. For example, here are three random substrings of the previous string:

1. ATCCTAGCTAGCTACGA


2. AGTCGATCGATCG

3. CGATCGATAATCGATCCTAG

Now, you have to examine all of the reads, and figure out which ones overlap. In principle, you want to compare all of them to each other with the All-Pairs framework, but that would be computationally infeasible. Instead, there are a number of heuristics that can be used to generate candidate pairs, which then can be matched in detail and then assembled. For example, the three reads from before overlap like this:

AGTCGATCGATCGATAATCGATCCTAGCTAGCTACGA

.....................................

AGTCGATCGATCG........................

.......CGATCGATAATCGATCCTAG..........

....................ATCCTAGCTAGCTACGA

There are many wide open questions of exactly what heuristics to use in selecting candidates, performing alignments, and completing the assembly. Our job is to give researchers a modular framework that allows them to try many different kinds of algorithms, using hundreds or thousands of CPUs to complete the job quickly.

We started with the work queue framework from the Wavefront abstraction. An assembly master process reads the candidates and sequences from disk, builds small units of work, and sends them out to worker processes running on various grids. No particular alignment code is baked into the system. Instead, the user provides an alignment program written in whatever language they find convenient. The system moves the executable and the necessary files out to the execution node, and puts it to work.

Here is an example of the system in action on a multi-institutional grid. The X axis shows time, and the various lines show number of tasks running (red), percent complete (blue), and cumulative speedup (green). We started by running a worker on one workstation, then another, then on a 32-node cluster, then on the Notre Dame campus grid, then on Condor pools at Purdue and Wisconsin, growing up to nearly 700 CPUs total. About halfway through, we forced a failure by unplugging the workstation running the master. Upon restarting, the master loaded the completed results, and picked up right where it left off.



I'm looking forward to putting our system into a production mode and attacking some really big problems.

Wednesday, February 25, 2009

On Parallel Programming with Processes

About once a week, a well-meaning person stops by my office to ask a question like this:

I need to run about 1000 simulations that take about an hour each. I can't wait a thousand hours for the results, so I need to parallelize my simulation. So, should I re-write my application using threads, MPI, or something else?


For some reason, they are always disappointed by my response:

Just run multiple copies of your program at once.

The reasoning is very simple. You already have a complete, debugged program. You have multiple processors, and your operating system knows how to use them. Running four processes at once on a four CPU machine will give you four times the number of results in the same amount of time. Your work will be down in 250 hours instead of 1000. In fact, you can take the same sequential program and submit it to a large batch system that can run on 100 different processors at once and complete one hundred simulations in one hour. If you only get 99 hosts, that's ok, you will still get a 99x improvement.

The alternative is almost too awful to contemplate. Those who have written multithreaded or message passing programs knows that it sounds great on the chalkboard, but the reality is much more complicated. Debugging tools are ineffective on parallel programs. Many existing libraries are not thread safe. You have to deal with synchronization problems, and an endless slew of tuning parameters. If you write a message passing program that requires eight hosts, then you need to wait until you have exactly eight hosts available for your exclusive use. It is all too likely that you will spend more time trying to correct the program than you actually will running it.

The funny part is, many people do not like this advice. But... that's not... parallel! Or, if they concede it's parallel, it's merely embarassingly parallel, or even worse, shamefully parallel. (As if using 100 CPUs simultaneously with processes was somehow less parallel than using 8 CPUs with threads.) They were hoping to doing some really difficult, macho programming, but now find there is a simple solution to the problem.

Now, I'm over-simplifying a little bit. There are certainly cases where it makes sense to take an existing program and parallelize it to use multiple processors. There are a few good reasons for doing so. First, if you really need one particular result as soon as possible, then it makes sense to parallelize. For example, if you are predicting tomorrow's weather, you need the result before tomorrow. Second, if your sequential program has fully consumed another resource on the machine, then it may make sense to parallelize. For example, if your simulation uses all available memory on a machine, then you cannot run two copies at once on the same machine. Third, if one program will run for longer than you can realistically keep a computer up without rebooting, then it may make sense to parallelize. However, none of these cases are as common as you might think, and it's usually best to avoid threads or message passing until the necessity has been proven.

A middle ground that we have been exploring in my lab is the use of abstractions to represent large parallel programs. In this approach, the user provides a sequential program that performs the key kernel of computation in which they specialize. Many invocations of the kernel are then combined together to run very large parallel programs with parallelism measured in hundreds of CPUs. You can read more about the BXGrid, Wavefront, All-Pairs, and Classify abstractions.

Saturday, February 21, 2009

Exponential Backoff in Distributed Systems

In response to my previous article, a commenter asked:

Why exponential backoff? To put a finer point on the question, How should I choose the parameters for my exponential backoff algorithm? I think many people choose parameters that back off too much, too fast.

The idea of exponential backoff in distributed systems goes back quite a few years. An early example can be found in the Ethernet network. In its original form, an Ethernet consisted of a single cable connecting all stations on the network. Unlike some other computer networks at the time, it had no direct means of controlling which station could transmit at any time. If one station transmitted while everyone else was silent, then the message would be received by all stations. But, if two (or more) transmitted at once, every station would receive a corrupted message.

Here's an analogy. Imagine a school gymnasium with people lined up along the walls. People have to shout to be heard, and there are multiple conversations going on at once. As you probably know from experience, this can only work if one person speaks at a time. So, each person waits for a quiet moment to speak. Occasionally, two people try to speak simultaneously, and then you have a silly game of each waiting a bit and then trying again until the tie is broken.

That is essentially how Ethernet works. Each party that wants to transmit waits for a quiet moment, and then sends a message. The sender also simultaneously listens to see if it can hear its own message. If the message is corrupted, it means another party transmitted at the same time, so both wait a bit and try again.

The essential question is: How long should each station wait?

It does no good to have each party wait a fixed amount of time -- say, one microsecond -- because then each will try again at the same time, and the situation repeats forever. A better idea is to choose a random time -- say, between one and ten microseconds, which will break the tie in a small number of attempts. However, if many parties are trying to talk at once, the result will still be a chaotic mess of messages, with no-one making any progress.

A more robust solution is for each party to use exponentially increasing delays. For example, delay one microsecond plus a random factor the first time, then two, then four, and so on. This solution works regardless of the number of competing parties, because it tends to thin the traffic out over time until the congestion is eased.

I wrote a paper titled The Ethernet Approach to Grid Computing on this topic a few years back, making the observation that this strategy is needed everywhere in distributed systems. Whenever you talk to a file server, a batch system, a print server, or file your taxes online, failures are possible, so you need to use Ethernet-like strategies. To encourage this, I wrote a simple language called the Fault Tolerant Shell which looks a lot like a conventional shell with exceptions. For example, here is how to reliably submit a Condor job:

try for 1 hour
condor_submit job.file
end

Or, if you have a choice of three different places to fetch a file from:

forany server in X, Y, Z
wget http://$server/myfile
end

Internally, the shell takes care of all of the error detection, retries, and so forth, so that the programmer can concentrate on the essential issues. The end result is that the system becomes much more robust to load bursts. For example, the following graph shows the performance of many clients submitting batch jobs to a queue using three methods: the Ethernet approach, the Aloha approach (an intermediate step), and a simple fixed retry:


As you can see, the fixed approach crashes to zero after about 400 clients, whereas the Ethernet approach continues to maintain a high level of throughput. It is not as high as the performance under low load, but it is relatively stable over a wide range of load.

The disadvantage to using exponential backoff is that it is going to extend the time to recovery after a failure by about a factor of two. Suppose that you are a client talking to a web server which crashes. You wait one second, try again, then two seconds, and so on. If the web server is unavailable for thirty seconds and then recovers, the client will not notice right away, because it will be in the middle of waiting for thirty seconds before trying again. Now, extending a thirty second outage to a sixty second outage is unlikely to cause any real heartache. But, what about extending a thirty minutes to sixty minutes? That could be irate customer territory.

So, you need to balance the needs of your customers against the capacity of your system. If you you want to handle 1000 clients and have a maximum recovery-after-failure time of one second, then you had better make sure that your system can handle 1000 failed requests per second at a sustained rate. That may sound easy, but if each failed request involves a database query, a write to a log file, and an email to an administrator, then you will be quickly overwhelmed.

Now let's answer the original question: How should I pick the backoff parameters?

Let's assume that they delay chosen at any point is based on an initial timeout (T), an exponential factor (F), the number of retries so far (N), a random number (R), and a maximum timeout (M). Then:

delay = MIN( R * T * F ^ N , M )

  • R should be a random number in the range [1-2], so that its effect is to spread out the load over time, but always more conservative than plain backoff.
  • T is the initial timeout, and should be set at the outer limits of expected response time for the service. For example, if your service responds in 1ms on average but in 10ms for 99% of requests, then set t=10ms.
  • F doesn't matter much, so choose 2 as a nice round number. (It's the exponential nature that counts.)
  • M should be as low as possible to keep your customers happy, but high enough that the system can definitely handle requests from all clients at that sustained rate.

Sunday, February 8, 2009

Fail Fast, Fail Often

A common misconception among programmers is that software should always attempt to hide failures in distributed systems. This idea seems sensible at first, because distributed systems are full of failures of all kinds: machines crash, software fails, and networks go down, just to name a few. If I am writing a function called transfer_file() which copies a file from one place to another, then I should try to connect multiple times and restart failed transfers for about an hour before giving up, right?

It turns out that transparent fault tolerance is exactly the wrong approach in a large, layered software system. Instead, each layer of software should carefully define consistent failure conditions, and then feel free to fail as often as it wants to.

Here is why: If someone else builds an application that calls transfer_file(), the application itself knows a whole lot more about what kind of fault tolerance is needed. It may turn out that the application knows about several file servers, and if one cannot be reached immediately, then another will do just fine. On the other hand, perhaps transfer_file will be used in some batch workload that will run for weeks, so it is vital that the transfer be retried until success.

If you want to build a controllable system, then your building blocks must have very precise failure semantics.. Unfortunately, many system calls have such vague semantics that they are nearly impossible to use correctly in the presence of failures. Consider, for example, the Unix system call connect(), which initiates a TCP connection to a remote host. Here are some possible results from connect():
  1. If the host does not respond to IP traffic, connect() will block for an undetermined amount of time configured by the kernel (anywhere from minutes to hours), and then return ETIMEDOUT.
  2. If a router or switch determines that the host is not routable, then in a few seconds connect() will return with the error EHOSTUNREACH.
  3. If the host is up, but there is no process listening on the port, then connect() will return almost immediately with ECONNREFUSED.

Depending on the precise nature of the failure, the call might return immediately, or it might return after a few hours. And, the distinction between these failure modes hardly matters to the user: in each case, the requested service is simply not available. Imagine trying to build an application that will quickly connect to the first available server, out of three. Yuck.

To get around this, all our software uses an intermediate layer that does a fair amount of work to place consistent failure semantics on system calls. For example, instead of using BSD sockets directly, we have a layer called link with operations like this:

  • link_connect( address, port, timeout );
  • link_accept( link, timeout );
  • link_read( link, buffer, length, timeout );

Inside each of these operations, the library carefully implements the desired failure semantics. If an operation fails quickly, then it is retried (with an exponential backoff) until the timeout as expired. If an operation fails slowly, then it is cleanly aborted when the timeout expires. With these in place, we can build higher level operations that rely on network communication without getting unexpectedly stuck.

Here is an example where precise failure detection really matters. In an earlier post, I wrote about the Wavefront abstraction, which is a distributed computing model with a lot of dependencies. In a Wavefront problem, we must first execute one process in the lower left hand corner. Once that is complete, we can run two adjacent functions, then three, and so on:


If we run a Wavefront on a system of hundreds of processors, then delays are inevitable. What's more, a delay in the computation of any one result slows down the whole system. To avoid this problem, we keep running statistics on the expected computation time of any node, and set timeouts appropriately. If any one computation falls more than a few standard deviations beyond the average, we abort it and try it on another processor. We call this technique "Fast Abort".

Here is the effect of this technique on a large wavefront problem. The X axis shows time, and the Y axis shows the number of tasks currently running. The bottom line shows the reliable technique of waiting and retrying tasks until they succeed. The top line shows what happens with Fast Abort. As you can see, this technique much more rapidly reaches a high degree of parallelism.

The moral of the story is: Make failure conditions an explicit part of your interface. If you make it very clear how and when a call can fail, then it is very easy for applications to implement fault tolerance appropriate to the situation at hand.