Prof. Douglas Thain: bxgrid

Showing posts with label bxgrid. Show all posts

Monday, November 8, 2010

Sometimes It All Comes Together

Most days, software engineering involves compromises and imperfect solutions. It's rare for two pieces of software to mesh perfectly -- you always have to work to overcome the limitations or assumptions present in different modules. But, every once in a while, the pieces just come together in a satisfying way.

A few weeks back, we ran into a problem with BXGrid, our system for managing biometric data. Our users had just ingested a whole pile of new images and videos, and were browsing and validating the data. Because data was recently ingested, no thumbnails had been generated yet, so every screenful required a hundred or so thumbnails to be created from the high resolution images. Multiple that by each active user, and you have a web site stuck in the mud.

A better solution would be to generate all of the missing thumbnails offline in an orderly way. Since many of the transcoding operations are compute intensive, it makes sense to farm them out to a distributed system.

Peter Bui -- a graduate student in our group -- solved this problem elegantly by putting together almost all of our research software simultaneously. He used Weaver as the front-end language to query BXGrid and determine what thumbnails needed to be generated. Weaver generated a Makeflow to perform all of the transcodings. Makeflow used Work Queue to execute the tasks, with the Workers submitted to our campus Condor pool.

So far, so good. But, the missing piece was that Makeflow expects data to be available as ordinary files. At the time, this would have required that we copy several terabytes out of the archive onto the local disk, which wasn't practical. So, Peter wrote a module for Parrot which enabled access to BXGrid files under paths like /bxgrid/fileid/123. Then, while attached to Parrot, Makeflow could access the files, being none the wiser that they were actually located in a distributed system.

Put it all together, and you have this:

Sometimes, it all comes together.

Wednesday, October 27, 2010

From Database to Filesystem and Back Again

Hoang Bui is leading the development of ROARS: a Rich Object Archival System, which is our generalization many of the ideas expressed in the Biometrics Research Grid. Hoang presented a paper on ROARS at the workshop on Data Intensive Distributed Computing earlier this year.

What makes ROARS particularly interesting is that it combines elements of both relational databases and file systems, and makes it possible to swap back and forth between both representations of the data.

A ROARS repository is an unordered collection of items. Each item consists of a binary file and metadata that describes the file. The metadata does not have a schema; you can attach whatever properties you like to an object. Here is an example item consisting of a iris image with five properties:

fileid = 356
subjectid = "S123"
color = "Blue"
camera = "Likon"
date = "23-Oct-2010"
type = "jpeg"

If you like to think in SQL, then you can query the system via SQL and you get back tabular data, as you might expect:

SELECT fileid, subjectid, color FROM irises WHERE color='Blue';

Of course, if you are going to actually process the files in some way, you need to put them into a filesystem where your scripts and tools can access them. For this, you use the EXPORT command, which will produce the files. EXPORT has a neat bit of syntax in which you can specify that the name of each file is generated from the metadata. For example this command:

EXPORT irises WHERE camera='Likon' AS color/subjectid.type

will dump out all of the matching files, put them into directories according to color, and name each file according to the subject and the file type. The example above would be named "Blue/S123.jpeg". (If the naming scheme doesn't result in unique filenames, then you need to adjust it to include something that is unique like fileid.)

Of course, if you are going to process a huge amount of data, then you don't actually want to copy all of it out to your local filesystem. Instead what you can do is create a "filesystem view" which is a directory tree containing pointers back to the objects in the repository. That has a very similar syntax:

VIEW irises WHERE camera='Likon' AS color/subjectid.type

Creating a filesystem view is much faster than exporting the actual data. Now, you can run your programs or scripts to iterate over the files. As they open up each file, the repository is accessed directly to open and read the necessary file data. (This is accomplished transparently by using Parrot to connect to the repository.)

The end result: a database that looks like a filesystem!

Monday, August 3, 2009

REU Project: BXGrid

This post continues last week's subject of summer REU projects.

Rachel Witty and Kameron Srimoungchanh worked on BXGrid, our web portal and computing system for biometrics research. This project is a collaboration between the Cooperative Computing Lab and the Computer Vision Research Lab at Notre Dame. Hoang Bui is the lead graduate student on the project. Rachel and Kameron added a bunch of new capabilities to the system; I'll show three examples today.

The first is the ability to handle 3-D face scans taken by a specialized camera equipped with a laser rangefinder. The still picture here doesn't quite do it justice, because each white "mask" on the left is a rotating animation of the face. By integrating this data into BXGrid, the 3-D data can be validated against previous ordinary images of the face.

I previously discussed All-Pairs problems, which are common in biometrics. While we already had the ability to run very large All-Pairs, problems, we never had the capability to view the results easily. Now, with the click of a button, you can set up a small All-Pairs problem and view the results on the portal:

Currently, new data ingested into the system is validated manually by people who must visually check that a eye, face, or whatever matches existing data in the system. Although this can be divided up among a large team of people, it is still time consuming and error prone.

Kameron and Rachel built a system that does a first pass at this task automatically. Using Makeflow, they set up a system to export all newly ingested images along with five good images that should match. This results in thousands of jobs sent to our Condor pool, which transform and compare the images. When all the results come back, you get a nice web page that summarizes the images and the results:

This research was supported in part by the National Science Foundation via grant NSF-CCF-0621434.

Tuesday, July 28, 2009

REU Project: Biocompute

This summer, we hosted four REU students who contributed to two web portals for distributed computing: Biocompute and BXGrid. I'll write about one this week and the other next week.

REU students Ryan Jansen and Joey Rich worked with recent grad Rory Carmichael on Biocompute, our web portal and computing system for bioinformatics research. Biocompute was originally created by Patrick Braga-Henebry for his B.S. honors thesis, and we are now putting it into production in collaboration with the Bioinformatics Core Facility at Notre Dame.

Biocompute allows researchers at Notre Dame to run standard bioinformatics tools like BLAST, and then share and manage the results. The new twist is that we transparently parallelize the tasks and run them on our campus Condor pool. This allows people to run tasks that were previously impossible: we routinely run workloads that would take months on a single machine, but get completed in hours on Biocompute.

The user simply fills out a form specifying the query, genomic databases, and so forth:

Biocompute transforms the request into a large Makeflow job that looks like this:

Users and administrators can view the progress of each job:

When the task is complete, you can browse the results, download them, or feed them into another tool on the web site:

This work was sponsored in part by the Bioinformatics Core Facility and the National Science Foundation under grant NSF-06-43229.

Monday, December 29, 2008

BXGrid: The Biometrics Research Grid

One of our graduate students, Hoang Bui, presented a poster on the Biometrics Research Grid (BXGrid) at the IEEE e-Science conference in Indianapolis a few weeks ago. BXGrid is a large data repository that we have built to support both production research in biometrics and to provide a platform for research in large scale data intensive computing. It provides another example of the idea of abstractions for distributed computing. This project is a collaboration between Dr. Patrick Flynn and my research group at Notre Dame.

The Computer Vision Research Lab studies methods for identifying people via biometrics such as fingerprints, iris scans, and surveillance videos. The group collects hundreds of thousands of images and movies from hundreds of volunteers on campus, and uses them to test clever new identification algorithms. For example, here is an atlas of photos from one particular subject (our department chair):

Each image is annotated with metadata that describes who the subject is, what camera took the picture, what the conditions were, and so forth. In BXGrid, you can see all the metadata for a given image like this:

Before BXGrid, all of this data was stored in an ordinary file system as big directories of images. This worked acceptably, but required an enormous amount of error prone scripting in order to answer interesting research questions. For example, a user might want to locate all close up face images taken in low light with a given camera, using only data for subjects with more than twenty images. You can do this in a filesystem, but it isn't easy, and it certainly isn't fast.

So, we designed BXGrid to be a filesystem-database hybrid that can store large amounts of data reliably, but enable new modes of exploration. The system consists of one central database that indexes all of the metadata, and sixteen active storage servers that provide storage that scales in both capacity and performance. Each item in the system is replicated three times across the cluster for reliability, so you can continue to operate even with several storage servers offline. A nice web interface on the front makes it easy to search, download, and process data from your desktop.

What's more is that the database really simplifies tasks that were previously arduous. For example, when ingesting new data into the system, a human needs to manually verify that each image really is of the intended person. BXGrid can simply pop up a screen that shows newly images alongside a selection of known good images of the old subject, and the user can quickly scan them and press a button if there is a problem. What used to be a Sisyphean task for one poor graduate student can now be accomplished by ten people working together in a few hours. Here is what it looks like:

The next step is to automate research tasks using abstractions. Many research problems in biometrics can be answered using the following formula:

Select a number of images according to some criterion.
Transform those images by a standard function.
Compare all of those images to each other with another function.

To formalize it a little bit, we call this the BXGrid abstraction:

S = Select( R );
T = Transform( S, F(s) );
M = AllPairs( T, G(x,y) );

Although easily stated, each of these steps is computationally expensive to perform on a large amount of data. On a single machine, a workload could take months just to compute.

However, BXGrid can be used to dramatically accelerate discovery. The database facilitates fast Select operations by virtue of indexing, the active storage cluster acclerates Transform by virtue of parallel storage, and our computing grid provides the All-Pairs capability on hundreds of processors. Once results are generated, they are sent back to the database where they can be shared with other users.

Prof. Douglas Thain