Prof. Douglas Thain: data mining

Friday, October 31, 2008

An Abstraction for Ensemble Classifiers

In the last post, I presented the idea of abstractions for distributed computing, and explained the All-Pairs abstraction, which represents a very large Cartesian product. Of course, a single abstraction is meant to address a very focused kind of workload. If you have a different category of problem, then you need another abstraction.

We discovered another abstraction while working with Nitesh Chawla's data mining group, also at Notre Dame. A common construction in data mining is the ensemble classifer. A single classifier examines the properties of a large number of objects and divides them up into groups that are roughly similar. You may be familiar with algorithms such as K-Nearest-Neighbors or Decision Trees. Improvements upon these algorithms continue to be an active area of research.

For any given classifier, you can often improve the runtime or the accuracy of the classification by breaking the data into pieces, running the classifier on each piece, and then collecting all of the results, using a majority vote to determine the final classification. For very large datasets, you may even need to use multiple processors and disks to complete the job in a reasonable amount of time.

To address this problem, our students Chris Moretti and Karsten Steinhauser created the Classify abstraction:

Classify( T, R, P, N, F ):
T - The testing set of data to classify.
R - The training set used to train each classifier.
P - The method of partitioning the data.
N - The number of classifiers to employ.
F - The classifier function to use.

Here is a schematic of the Classify abstraction:

This abstraction was also implemented on our local Condor pool and Chirp storage cluster, using one CPU and disk for each classifier function. With this implementation, Chris and Karsten were able to evaluate a number of classifier functions on multi-gigabyte datasets, using up to 128 CPUs simultaneously. In a week, they were able to accomplish what might have taken years to organize and execute by hand. You can read more about the technical details in our paper which will be presented at the International Conference on Data Mining.

The observant reader may notice that the Classify abstraction looks a lot like the Map-Reduce abstraction implemented in the Hadoop. In the next post, I'll discuss this similarity and explain the important difference between the two abstractions.

Monday, October 6, 2008

Troubleshooting Distributed Systems via Data Mining

One of our students, David Cieslak, just presented this paper on troubleshooting large distributed systems at the IEEE Grid Computing Conference in Japan. Here's the situation:

You have a million jobs to run.
You submit them to a system of thousands of CPUs.
Half of them complete, and half of them fail.

Now what do you do? It's hopeless to debug any single failure, because you don't know if it represents the most important case. It's a thankless job to dig through all of the log files of the system to see where things went wrong. You might try re-submitting all of the jobs, but chances are that half of those will fail, and you will just be wasting more time and resources pushing them to the finish.

Typically, these sorts of errors come arise from an incompatibility of one kind or another. The many versions of Linux present the most outrageous examples. Perhaps your job assumes that the program wget can be found on any old Linux machine. Oops! Perhaps your program is dynamically linked against SSL version 2.3.6.5.8.2.1, but some machines only have version 2.3.6.5.8.1.9. Oops! Perhaps your program crashes on a machine with more than 2GB of physical memory, because it performs improper pointer arithmetic. Oops!

So, to address this problem, David has constructed a nice tool that reads in some log files, and then diagnoses the properties of machines or jobs associated with failures, using techniques from the field of data mining. (We implemented this on log files from Condor, but you could apply the principle to any similar system.) Of course, the tool cannot diagnose the root cause, but it can help to narrow down the source of the problem.

For example, consider the user running several thousand jobs on our 700 CPU Condor pool. Jobs tended to fail within minutes on certain set of eleven machines. Of course, as soon as those jobs failed, the machines were free to run more jobs, which promptly failed. By applying GASP, we discovered a common property among those machines:

(TotalVirtualMemory < 1048576)

They only had one gigabyte of virtual memory! (Note: The units are KB.) Whenever a program would consume more than that, it was promptly killed by the operating system. This was simply a mistake made in configuration -- our admins fixed the setting, and the problem went away.

Here's another problem we found on the Wisconsin portion of the Open Science Grid. Processing the log data from 100,000 jobs submitted in 2007, we found that most failures were associated with this property:

(FilesystemDomain="hep.wisc.edu")

It turns out that a large number of users submitted jobs assuming that the filesystem they needed would be mounted on all nodes of the grid. Not so! Since this was an historical analysis, we could not repair the system, but it did give us a rapid understanding of an important usability aspect of the system.

If you want to try this out yourself, you can visit our web page for the software.

Prof. Douglas Thain

Friday, October 31, 2008

An Abstraction for Ensemble Classifiers

Monday, October 6, 2008

Troubleshooting Distributed Systems via Data Mining