Prof. Douglas Thain: allpairs

Showing posts with label allpairs. Show all posts

Monday, November 1, 2010

Compiling Workflows with Weaver

Over the last year, our Makeflow system has become quite popular here at Notre Dame. Briefly, Makeflow takes a workload expressed in the plain old Make format, and executes it in a distributed system, using the dependency information to set up the appropriate remote execution environment. It does not require a distributed filesystem, so it's easy to get your applications going on hundreds to thousands of processors from the cloud or the grid. Makeflow is currently the back-end engine for our science portals in bioinformatics, biometrics, and molecular dynamics.

It didn't take long before our users started writing scripts in Perl or Python in order to generate Makeflows with tens of thousands of nodes. Those scripts all did similar things (query a database, break a dataset into N pieces) but also started to get unruly and difficult to debug. It wasn't easy to look at a script generator and determine what it was trying to accomplish.

Enter Weaver, which is the creation of Peter Bui, one of our graduate students. Weaver is a high level Python framework that, in a few simple lines, can generate enormous (optimized) Makeflows. Peter presented a paper about Weaver at the workshop on Challenges of Large Applications in Distributed Environments at HPDC earlier this year.

Consider this common biometrics workload: extract all of the blue irises from our repository, then convert each iris into a 'template' data type, then compare all of them to each other. Here is how you do it in Weaver:


b = SQLDataSet (’db’,’biometrics','irises')
nefs = Query(db,db.color='Blue')

conv = SimpleFunction('convertiristotemplate',outsuffix='bit')
bits = Map(conv,nefs)

cmp = SimpleFunction('compareiristemplates')
AllPairs(cmp,bits,bits,output='matrix.txt')

In the simplest case, Weaver just emits one gigantic Makeflow that performs all of the operations. However, sometimes there are patterns that can be executed more efficiently, given some better underlying tool. AllPairs is the perfect example of this optimization -- you can do an AllPairs using Makeflow, but it won't be as efficient as our native implementation. If the various data types line up appropriately, Weaver will simply call the All-Pairs abstraction. If not, it will expand the call into Makeflow in the appropriate way.

In principle, this is a lot like a C compiler: under certain conditions, the addition of two arrays can be accomplished with a vector add instruction, otherwise it must be expanded into a larger number of conventional instructions. So, we think of Weaver as a compiler for workflows: it chooses the best implementation available to execute a complex program, leaving the programmer to worry about the higher level objectives.

Monday, August 3, 2009

REU Project: BXGrid

This post continues last week's subject of summer REU projects.

Rachel Witty and Kameron Srimoungchanh worked on BXGrid, our web portal and computing system for biometrics research. This project is a collaboration between the Cooperative Computing Lab and the Computer Vision Research Lab at Notre Dame. Hoang Bui is the lead graduate student on the project. Rachel and Kameron added a bunch of new capabilities to the system; I'll show three examples today.

The first is the ability to handle 3-D face scans taken by a specialized camera equipped with a laser rangefinder. The still picture here doesn't quite do it justice, because each white "mask" on the left is a rotating animation of the face. By integrating this data into BXGrid, the 3-D data can be validated against previous ordinary images of the face.

I previously discussed All-Pairs problems, which are common in biometrics. While we already had the ability to run very large All-Pairs, problems, we never had the capability to view the results easily. Now, with the click of a button, you can set up a small All-Pairs problem and view the results on the portal:

Currently, new data ingested into the system is validated manually by people who must visually check that a eye, face, or whatever matches existing data in the system. Although this can be divided up among a large team of people, it is still time consuming and error prone.

Kameron and Rachel built a system that does a first pass at this task automatically. Using Makeflow, they set up a system to export all newly ingested images along with five good images that should match. This results in thousands of jobs sent to our Condor pool, which transform and compare the images. When all the results come back, you get a nice web page that summarizes the images and the results:

This research was supported in part by the National Science Foundation via grant NSF-CCF-0621434.

Monday, December 29, 2008

BXGrid: The Biometrics Research Grid

One of our graduate students, Hoang Bui, presented a poster on the Biometrics Research Grid (BXGrid) at the IEEE e-Science conference in Indianapolis a few weeks ago. BXGrid is a large data repository that we have built to support both production research in biometrics and to provide a platform for research in large scale data intensive computing. It provides another example of the idea of abstractions for distributed computing. This project is a collaboration between Dr. Patrick Flynn and my research group at Notre Dame.

The Computer Vision Research Lab studies methods for identifying people via biometrics such as fingerprints, iris scans, and surveillance videos. The group collects hundreds of thousands of images and movies from hundreds of volunteers on campus, and uses them to test clever new identification algorithms. For example, here is an atlas of photos from one particular subject (our department chair):

Each image is annotated with metadata that describes who the subject is, what camera took the picture, what the conditions were, and so forth. In BXGrid, you can see all the metadata for a given image like this:

Before BXGrid, all of this data was stored in an ordinary file system as big directories of images. This worked acceptably, but required an enormous amount of error prone scripting in order to answer interesting research questions. For example, a user might want to locate all close up face images taken in low light with a given camera, using only data for subjects with more than twenty images. You can do this in a filesystem, but it isn't easy, and it certainly isn't fast.

So, we designed BXGrid to be a filesystem-database hybrid that can store large amounts of data reliably, but enable new modes of exploration. The system consists of one central database that indexes all of the metadata, and sixteen active storage servers that provide storage that scales in both capacity and performance. Each item in the system is replicated three times across the cluster for reliability, so you can continue to operate even with several storage servers offline. A nice web interface on the front makes it easy to search, download, and process data from your desktop.

What's more is that the database really simplifies tasks that were previously arduous. For example, when ingesting new data into the system, a human needs to manually verify that each image really is of the intended person. BXGrid can simply pop up a screen that shows newly images alongside a selection of known good images of the old subject, and the user can quickly scan them and press a button if there is a problem. What used to be a Sisyphean task for one poor graduate student can now be accomplished by ten people working together in a few hours. Here is what it looks like:

The next step is to automate research tasks using abstractions. Many research problems in biometrics can be answered using the following formula:

Select a number of images according to some criterion.
Transform those images by a standard function.
Compare all of those images to each other with another function.

To formalize it a little bit, we call this the BXGrid abstraction:

S = Select( R );
T = Transform( S, F(s) );
M = AllPairs( T, G(x,y) );

Although easily stated, each of these steps is computationally expensive to perform on a large amount of data. On a single machine, a workload could take months just to compute.

However, BXGrid can be used to dramatically accelerate discovery. The database facilitates fast Select operations by virtue of indexing, the active storage cluster acclerates Transform by virtue of parallel storage, and our computing grid provides the All-Pairs capability on hundreds of processors. Once results are generated, they are sent back to the database where they can be shared with other users.

Wednesday, October 22, 2008

Abstractions for Distributed Computing

My current research revolves around the idea of abstractions for distributed computing. An abstraction is a way of simplifying a workload that runs on thousands of machines, in much the same way that a high level language simplifies the tiresome process of programming in assembly language. Let me explain a little more.

Real assembly language has operations like this:

MOV memory to register
PUSH register to stack
CALL procedure

As you may know, programming in assembly language stinks. The programmer has to keep track of the limited number of registers in use, the current state of the stack, and the meaning of external memory locations. If that's not enough, you have to worry about the behavior of instructions that have wildly varying runtimes, unusual exceptions, and sometimes asynchronous behavior.

We find much the same story in distributed computing, where the operations are something like:

TransferFile( source, destination );
ExecuteJob( executable, input, output );
AllocateVM( cpu, mem, disk, time );

If this is your instruction set, then you have many of the same problems. You have to manage a limited amount of local and remote storage, carefully cleaning up when jobs complete. If that's not enough, you have to worry about the behavior of instructions that have wildly varying runtimes, unusual exceptions, and sometimes asynchronous behavior. Sound familiar?

So, we advocate that users should employ high level abstractions that hide many of these ugly details, allowing the user to focus on the big picture. An abstraction expresses a pattern of computation that is simple, but may be very large, and thus requires significant effort to implement correctly and efficiently.

Chris Moretti is working on one abstraction called AllPairs. This abstraction crops up in a variety of domains, including biology, bioinformatics, and data mining, to name a few. It is easily stated:

AllPairs( set A, set B, function F ) returns matrix M
where M[i,j] = F( A[i],B[j] );

Of course, AllPairs is easy to execute on a small problem: just write a nested loop. But, what if sets A and B have 10,000 elements of 1MB each, and the function F takes ten seconds to run? You would be looking at 100TB of I/O and 1157 days of computation.

Instead, you need to harness a distributed system such as Condor to split up the computation across hundreds of machines. But you can't just submit 100,000,000 jobs: the queueing system would fall over. You can't just blindly have each job access the data over the network: the file server would file over. Without some careful tuning, your task will run even more slowly than the sequential version. The larger the problem gets, the more you have to worry about.

Fortunately, if your program fits in the All-Pairs abstraction, then we have solved the problem for you. If your "sets" are a bunch of files in a directory and your "function" is a Unix program that compares two objects, then you can simply invoke this at the command line:

allpairs A B Func

In the background, the All-Pairs implementation will measure the size of the objects, test the runtime of the function, choose the resources to use, distribute the data to the nodes, deal with failures, and then clean up the system. The user only has to worry about the problem to solve, not the method of achieving it.

In short, by using an abstraction, we can guide our users to appropriate ways of solving problems, and get them results in a reasonable time. You can read more about this in our paper on All-Pairs presented at IPDPS in the spring.

In future posts, I'll elaborate on other abstractions that we are designing and implementing.

Prof. Douglas Thain