Tuesday, July 28, 2009

REU Project: Biocompute

This summer, we hosted four REU students who contributed to two web portals for distributed computing: Biocompute and BXGrid. I'll write about one this week and the other next week.

REU students Ryan Jansen and Joey Rich worked with recent grad Rory Carmichael on Biocompute, our web portal and computing system for bioinformatics research. Biocompute was originally created by Patrick Braga-Henebry for his B.S. honors thesis, and we are now putting it into production in collaboration with the Bioinformatics Core Facility at Notre Dame.

Biocompute allows researchers at Notre Dame to run standard bioinformatics tools like BLAST, and then share and manage the results. The new twist is that we transparently parallelize the tasks and run them on our campus Condor pool. This allows people to run tasks that were previously impossible: we routinely run workloads that would take months on a single machine, but get completed in hours on Biocompute.

The user simply fills out a form specifying the query, genomic databases, and so forth:

Biocompute transforms the request into a large Makeflow job that looks like this:

Users and administrators can view the progress of each job:

When the task is complete, you can browse the results, download them, or feed them into another tool on the web site:

This work was sponsored in part by the Bioinformatics Core Facility and the National Science Foundation under grant NSF-06-43229.


  1. This is an excellent example of the makeflow system that you mentioned in your previous post.

    I am curious about the types of Condor job requirements used. Are they standard x86 & linux or more specific? Also, does the system make a condor job for each blast job or does it group them into a number of units?

  2. The current version of biocompute runs on a 64-core cluster of 64-bit Linux machines. However, in the near future, we intend to generalize the system to run on any old Linux machine available in the Condor pool.

    The typical user submits a whole file of, say, 100,000 strings to query against the databases. Biocompute will break that up into, say, 100 Condor jobs of 1000 queries each. We discovered pretty quickly that this doesn't quite work, since some queries are only a few bytes long, while others are thousands of bytes long, so in the second version, we split up jobs by total number of bytes in the query, rather than total number of queries.