Friday, October 31, 2008

An Abstraction for Ensemble Classifiers

In the last post, I presented the idea of abstractions for distributed computing, and explained the All-Pairs abstraction, which represents a very large Cartesian product. Of course, a single abstraction is meant to address a very focused kind of workload. If you have a different category of problem, then you need another abstraction.

We discovered another abstraction while working with Nitesh Chawla's data mining group, also at Notre Dame. A common construction in data mining is the ensemble classifer. A single classifier examines the properties of a large number of objects and divides them up into groups that are roughly similar. You may be familiar with algorithms such as K-Nearest-Neighbors or Decision Trees. Improvements upon these algorithms continue to be an active area of research.

For any given classifier, you can often improve the runtime or the accuracy of the classification by breaking the data into pieces, running the classifier on each piece, and then collecting all of the results, using a majority vote to determine the final classification. For very large datasets, you may even need to use multiple processors and disks to complete the job in a reasonable amount of time.

To address this problem, our students Chris Moretti and Karsten Steinhauser created the Classify abstraction:

Classify( T, R, P, N, F ):
T - The testing set of data to classify.
R - The training set used to train each classifier.
P - The method of partitioning the data.
N - The number of classifiers to employ.
F - The classifier function to use.

Here is a schematic of the Classify abstraction:

This abstraction was also implemented on our local Condor pool and Chirp storage cluster, using one CPU and disk for each classifier function. With this implementation, Chris and Karsten were able to evaluate a number of classifier functions on multi-gigabyte datasets, using up to 128 CPUs simultaneously. In a week, they were able to accomplish what might have taken years to organize and execute by hand. You can read more about the technical details in our paper which will be presented at the International Conference on Data Mining.

The observant reader may notice that the Classify abstraction looks a lot like the Map-Reduce abstraction implemented in the Hadoop. In the next post, I'll discuss this similarity and explain the important difference between the two abstractions.

Wednesday, October 22, 2008

Abstractions for Distributed Computing

My current research revolves around the idea of abstractions for distributed computing. An abstraction is a way of simplifying a workload that runs on thousands of machines, in much the same way that a high level language simplifies the tiresome process of programming in assembly language. Let me explain a little more.

Real assembly language has operations like this:

  • MOV memory to register
  • PUSH register to stack
  • CALL procedure

As you may know, programming in assembly language stinks. The programmer has to keep track of the limited number of registers in use, the current state of the stack, and the meaning of external memory locations. If that's not enough, you have to worry about the behavior of instructions that have wildly varying runtimes, unusual exceptions, and sometimes asynchronous behavior.

We find much the same story in distributed computing, where the operations are something like:

  • TransferFile( source, destination );
  • ExecuteJob( executable, input, output );
  • AllocateVM( cpu, mem, disk, time );

If this is your instruction set, then you have many of the same problems. You have to manage a limited amount of local and remote storage, carefully cleaning up when jobs complete. If that's not enough, you have to worry about the behavior of instructions that have wildly varying runtimes, unusual exceptions, and sometimes asynchronous behavior. Sound familiar?

So, we advocate that users should employ high level abstractions that hide many of these ugly details, allowing the user to focus on the big picture. An abstraction expresses a pattern of computation that is simple, but may be very large, and thus requires significant effort to implement correctly and efficiently.

Chris Moretti is working on one abstraction called AllPairs. This abstraction crops up in a variety of domains, including biology, bioinformatics, and data mining, to name a few. It is easily stated:

AllPairs( set A, set B, function F ) returns matrix M
where M[i,j] = F( A[i],B[j] );

Of course, AllPairs is easy to execute on a small problem: just write a nested loop. But, what if sets A and B have 10,000 elements of 1MB each, and the function F takes ten seconds to run? You would be looking at 100TB of I/O and 1157 days of computation.

Instead, you need to harness a distributed system such as Condor to split up the computation across hundreds of machines. But you can't just submit 100,000,000 jobs: the queueing system would fall over. You can't just blindly have each job access the data over the network: the file server would file over. Without some careful tuning, your task will run even more slowly than the sequential version. The larger the problem gets, the more you have to worry about.

Fortunately, if your program fits in the All-Pairs abstraction, then we have solved the problem for you. If your "sets" are a bunch of files in a directory and your "function" is a Unix program that compares two objects, then you can simply invoke this at the command line:

allpairs A B Func

In the background, the All-Pairs implementation will measure the size of the objects, test the runtime of the function, choose the resources to use, distribute the data to the nodes, deal with failures, and then clean up the system. The user only has to worry about the problem to solve, not the method of achieving it.

In short, by using an abstraction, we can guide our users to appropriate ways of solving problems, and get them results in a reasonable time. You can read more about this in our paper on All-Pairs presented at IPDPS in the spring.

In future posts, I'll elaborate on other abstractions that we are designing and implementing.

Monday, October 6, 2008

Troubleshooting Distributed Systems via Data Mining

One of our students, David Cieslak, just presented this paper on troubleshooting large distributed systems at the IEEE Grid Computing Conference in Japan. Here's the situation:
  • You have a million jobs to run.
  • You submit them to a system of thousands of CPUs.
  • Half of them complete, and half of them fail.
Now what do you do? It's hopeless to debug any single failure, because you don't know if it represents the most important case. It's a thankless job to dig through all of the log files of the system to see where things went wrong. You might try re-submitting all of the jobs, but chances are that half of those will fail, and you will just be wasting more time and resources pushing them to the finish.

Typically, these sorts of errors come arise from an incompatibility of one kind or another. The many versions of Linux present the most outrageous examples. Perhaps your job assumes that the program wget can be found on any old Linux machine. Oops! Perhaps your program is dynamically linked against SSL version, but some machines only have version Oops! Perhaps your program crashes on a machine with more than 2GB of physical memory, because it performs improper pointer arithmetic. Oops!

So, to address this problem, David has constructed a nice tool that reads in some log files, and then diagnoses the properties of machines or jobs associated with failures, using techniques from the field of data mining. (We implemented this on log files from Condor, but you could apply the principle to any similar system.) Of course, the tool cannot diagnose the root cause, but it can help to narrow down the source of the problem.

For example, consider the user running several thousand jobs on our 700 CPU Condor pool. Jobs tended to fail within minutes on certain set of eleven machines. Of course, as soon as those jobs failed, the machines were free to run more jobs, which promptly failed. By applying GASP, we discovered a common property among those machines:

(TotalVirtualMemory < 1048576)

They only had one gigabyte of virtual memory! (Note: The units are KB.) Whenever a program would consume more than that, it was promptly killed by the operating system. This was simply a mistake made in configuration -- our admins fixed the setting, and the problem went away.

Here's another problem we found on the Wisconsin portion of the Open Science Grid. Processing the log data from 100,000 jobs submitted in 2007, we found that most failures were associated with this property:


It turns out that a large number of users submitted jobs assuming that the filesystem they needed would be mounted on all nodes of the grid. Not so! Since this was an historical analysis, we could not repair the system, but it did give us a rapid understanding of an important usability aspect of the system.

If you want to try this out yourself, you can visit our web page for the software.

Wednesday, October 1, 2008

Clusters, Grids, and Clouds:
It's Turtles All the Way Down

In this blog, I'll discuss open problems and new developments in the field of distributed systems.

A distributed system is a set of independent computers that accomplish a meaningful task by working together. The field covers a huge range of interesting systems, from sensor networks to multi-tier web server farms. For the most part, I will be writing about the sort of large computing systems used to attack very large problems generated by big science and big business.

These systems go by many different names that mean very similar things:

  • A cluster is a distributed system that consists of a number of identical machines owned by a single entity, usually stacked up in a closet or a machine room. Clusters became the most common form of high performance computing in the 1990s and are the type of system now dominating the Top 500 List of supercomputers.
  • A grid is a distributed system that enables people to access computing resources from different institutions over the wide area. The term grid computing was coined by Ian Foster and Carl Kesselman in the late 1990s to describe easy access to large scale computational power. Examples of grids include TeraGrid and the Open Science Grid.
  • A cloud is a distributed system where the user doesn't care exactly what resources are used to carry out an operation; this is virtualization in the most abstract sense. There exist commercial clouds such as Amazon EC2, as well proprietary clouds found in nearly any industrial scale web site.

Although people love to argue about these terms, the differences aren't terribly important. In fact, the terms often represent different facets of the same systems. When you submit a job to a cluster, you are treating it like a cloud, because you don't care on exactly which CPU the job runs. A grid is usually built up from multiple clusters and connected together by wide area networks and software. A cloud is usually contained in a single data center. If you need to access it remotely, then you need a grid.

If you want to really want to mix it up, read about Ed Walker's MyCluster, which is a system that uses a grid to allocate a personal cluster, which you can then use as a cloud!

Regardless of what we call these systems, the challenges are the same. How do we design the interactions between pieces so that the system is robust to failures, has acceptable performance, and protects the interests of all of the parties involved? Future posts in this column will focus on these fundamental problems.

So that you know where I am coming from, I am a professor at the University of Notre Dame, where I teach classes in distributed systems, operating systems, and compilers. I direct the Cooperative Computing Lab where a great group of students does the hard work to design and test new distributed systems. We develop solutions to problems in physics, chemistry, biometrics, and other fields, and them evaluate them on our shared testbed of 700 CPUs. I'll be writing more about these ideas in this column.