In the last post, I presented the idea of abstractions for distributed computing, and explained the All-Pairs abstraction, which represents a very large Cartesian product. Of course, a single abstraction is meant to address a very focused kind of workload. If you have a different category of problem, then you need another abstraction.
We discovered another abstraction while working with Nitesh Chawla's data mining group, also at Notre Dame. A common construction in data mining is the ensemble classifer. A single classifier examines the properties of a large number of objects and divides them up into groups that are roughly similar. You may be familiar with algorithms such as K-Nearest-Neighbors or Decision Trees. Improvements upon these algorithms continue to be an active area of research.
For any given classifier, you can often improve the runtime or the accuracy of the classification by breaking the data into pieces, running the classifier on each piece, and then collecting all of the results, using a majority vote to determine the final classification. For very large datasets, you may even need to use multiple processors and disks to complete the job in a reasonable amount of time.
To address this problem, our students Chris Moretti and Karsten Steinhauser created the Classify abstraction:
Classify( T, R, P, N, F ):
T - The testing set of data to classify.
R - The training set used to train each classifier.
P - The method of partitioning the data.
N - The number of classifiers to employ.
F - The classifier function to use.
Here is a schematic of the Classify abstraction:
This abstraction was also implemented on our local Condor pool and Chirp storage cluster, using one CPU and disk for each classifier function. With this implementation, Chris and Karsten were able to evaluate a number of classifier functions on multi-gigabyte datasets, using up to 128 CPUs simultaneously. In a week, they were able to accomplish what might have taken years to organize and execute by hand. You can read more about the technical details in our paper which will be presented at the International Conference on Data Mining.
The observant reader may notice that the Classify abstraction looks a lot like the Map-Reduce abstraction implemented in the Hadoop. In the next post, I'll discuss this similarity and explain the important difference between the two abstractions.