Tuesday, November 16, 2010

The Virtualization Theorem Ignored for Three Decades

Today, in my graduate operating systems class, we discussed what I believe is the most important result in computer science ever to be persistently ignored:

Popek and Goldberg, Formal Requirements for Virtualizible Third Generation Architectures, Communications of the ACM, Volume 17, Issue 7, July 1974.

This paper puts forth a very simple principle that must be observed in order for a CPU to be capable of running in a virtual machine. First, two definitions:
  • A sensitive instruction reads or modifies supervisor state
  • A privileged instruction traps if attempted in user mode.
And this central theorem:
  • All sensitive operations must be privileged.
Here is why this is important. A conventional operating system (OS) is in charge of the whole machine, and is free to modify the processor status, page tables, I/O devices, and other sensitive aspects of the machine in order to run normal processes.

But, if you take that OS and put it in a virtual machine (VM), it is no longer in charge of the whole machine. All of those actions on sensitive state must be translated in some way by the virtual machine monitor. The simplest way to accomplish that translation is to run the OS in user mode, allowing the VMM to execute sensitive operations on its behalf. To make sure that the VMM gets all of the sensitive operations, they must all be forced to trap.

This principle was articulated very clearly in 1974, when virtualization was already a widely applied technique in the world of mainframe computing. Unfortunately, the principle didn't make the leap into the microcomputer world. In fact, there was an enduring tradition of releasing processors that were not virtualizable, only to realize the mistake and issue a second version with a minor fix.

For example, the venerable Motorola 68000 was first released in 1978, and was heralded as a "mainframe on a chip". Except, it had one little problem: a read from the sensitive status register did not trap, preventing the construction of a virtual memory system. So, Motorola issued the 68010, which was almost identical, except that a read from the status register forced a trap, enabling correct virtualization of memory.

Unfortunately, not everybody got the memo.

For nearly three decades, the Intel x86 series of processors did not have this property. In user mode, many instructions could be used to view sensitive state, and many attempts to write sensitive state would fail silently without a trap. From the 1970s until the late 1990s, efficient virtualization was basically impossible on the most widely used processor family.

Around the year 2000, virtualization became of interest as a way to service the multi-tenancy needs of large internet services. A number of solutions were developed simultaneously to work around the limitations of the Intel chips. One approach used in VMWare was to dynamically compile assembly code at runtime to convert sensitive instructions into deliberate traps to the VMM. Another approach used in the Xen hypervisor was to modify the operating system code so that it explicitly called the VMM instead of invoking sensitive instructions.

There are many other approaches to working around the limitation. Suffice to say that they are all rather complicated, but they can be made to work.

Finally, in 2005, both Intel and AMD introduced virtualization extensions to their processors, enabling basic trap-and-execute virtualization, only 29 years after the Popek and Goldberg theorem was widely circulated.

So, what's the moral of the story?

Monday, November 8, 2010

Sometimes It All Comes Together

Most days, software engineering involves compromises and imperfect solutions. It's rare for two pieces of software to mesh perfectly -- you always have to work to overcome the limitations or assumptions present in different modules. But, every once in a while, the pieces just come together in a satisfying way.

A few weeks back, we ran into a problem with BXGrid, our system for managing biometric data. Our users had just ingested a whole pile of new images and videos, and were browsing and validating the data. Because data was recently ingested, no thumbnails had been generated yet, so every screenful required a hundred or so thumbnails to be created from the high resolution images. Multiple that by each active user, and you have a web site stuck in the mud.

A better solution would be to generate all of the missing thumbnails offline in an orderly way. Since many of the transcoding operations are compute intensive, it makes sense to farm them out to a distributed system.

Peter Bui -- a graduate student in our group -- solved this problem elegantly by putting together almost all of our research software simultaneously. He used Weaver as the front-end language to query BXGrid and determine what thumbnails needed to be generated. Weaver generated a Makeflow to perform all of the transcodings. Makeflow used Work Queue to execute the tasks, with the Workers submitted to our campus Condor pool.

So far, so good. But, the missing piece was that Makeflow expects data to be available as ordinary files. At the time, this would have required that we copy several terabytes out of the archive onto the local disk, which wasn't practical. So, Peter wrote a module for Parrot which enabled access to BXGrid files under paths like /bxgrid/fileid/123. Then, while attached to Parrot, Makeflow could access the files, being none the wiser that they were actually located in a distributed system.

Put it all together, and you have this:

Sometimes, it all comes together.

Monday, November 1, 2010

Compiling Workflows with Weaver

Over the last year, our Makeflow system has become quite popular here at Notre Dame. Briefly, Makeflow takes a workload expressed in the plain old Make format, and executes it in a distributed system, using the dependency information to set up the appropriate remote execution environment. It does not require a distributed filesystem, so it's easy to get your applications going on hundreds to thousands of processors from the cloud or the grid. Makeflow is currently the back-end engine for our science portals in bioinformatics, biometrics, and molecular dynamics.

It didn't take long before our users started writing scripts in Perl or Python in order to generate Makeflows with tens of thousands of nodes. Those scripts all did similar things (query a database, break a dataset into N pieces) but also started to get unruly and difficult to debug. It wasn't easy to look at a script generator and determine what it was trying to accomplish.

Enter Weaver, which is the creation of Peter Bui, one of our graduate students. Weaver is a high level Python framework that, in a few simple lines, can generate enormous (optimized) Makeflows. Peter presented a paper about Weaver at the workshop on Challenges of Large Applications in Distributed Environments at HPDC earlier this year.

Consider this common biometrics workload: extract all of the blue irises from our repository, then convert each iris into a 'template' data type, then compare all of them to each other. Here is how you do it in Weaver:

b = SQLDataSet (’db’,’biometrics','irises')
nefs = Query(db,db.color='Blue')

conv = SimpleFunction('convertiristotemplate',outsuffix='bit')
bits = Map(conv,nefs)

cmp = SimpleFunction('compareiristemplates')

In the simplest case, Weaver just emits one gigantic Makeflow that performs all of the operations. However, sometimes there are patterns that can be executed more efficiently, given some better underlying tool. AllPairs is the perfect example of this optimization -- you can do an AllPairs using Makeflow, but it won't be as efficient as our native implementation. If the various data types line up appropriately, Weaver will simply call the All-Pairs abstraction. If not, it will expand the call into Makeflow in the appropriate way.

In principle, this is a lot like a C compiler: under certain conditions, the addition of two arrays can be accomplished with a vector add instruction, otherwise it must be expanded into a larger number of conventional instructions. So, we think of Weaver as a compiler for workflows: it chooses the best implementation available to execute a complex program, leaving the programmer to worry about the higher level objectives.

Wednesday, October 27, 2010

From Database to Filesystem and Back Again

Hoang Bui is leading the development of ROARS: a Rich Object Archival System, which is our generalization many of the ideas expressed in the Biometrics Research Grid. Hoang presented a paper on ROARS at the workshop on Data Intensive Distributed Computing earlier this year.

What makes ROARS particularly interesting is that it combines elements of both relational databases and file systems, and makes it possible to swap back and forth between both representations of the data.

A ROARS repository is an unordered collection of items. Each item consists of a binary file and metadata that describes the file. The metadata does not have a schema; you can attach whatever properties you like to an object. Here is an example item consisting of a iris image with five properties:

fileid = 356
subjectid = "S123"
color = "Blue"
camera = "Likon"
date = "23-Oct-2010"
type = "jpeg"

If you like to think in SQL, then you can query the system via SQL and you get back tabular data, as you might expect:

SELECT fileid, subjectid, color FROM irises WHERE color='Blue';

Of course, if you are going to actually process the files in some way, you need to put them into a filesystem where your scripts and tools can access them. For this, you use the EXPORT command, which will produce the files. EXPORT has a neat bit of syntax in which you can specify that the name of each file is generated from the metadata. For example this command:

EXPORT irises WHERE camera='Likon' AS color/subjectid.type

will dump out all of the matching files, put them into directories according to color, and name each file according to the subject and the file type. The example above would be named "Blue/S123.jpeg". (If the naming scheme doesn't result in unique filenames, then you need to adjust it to include something that is unique like fileid.)

Of course, if you are going to process a huge amount of data, then you don't actually want to copy all of it out to your local filesystem. Instead what you can do is create a "filesystem view" which is a directory tree containing pointers back to the objects in the repository. That has a very similar syntax:

VIEW irises WHERE camera='Likon' AS color/subjectid.type

Creating a filesystem view is much faster than exporting the actual data. Now, you can run your programs or scripts to iterate over the files. As they open up each file, the repository is accessed directly to open and read the necessary file data. (This is accomplished transparently by using Parrot to connect to the repository.)

The end result: a database that looks like a filesystem!

Monday, October 18, 2010

Summer REU: Toward Elastic Scientific Applications

In recent months, we have been working on the problem of building elastic parallel applications that can adapt to the available resources at run-time. Much has been written about elastic internet services, but scientific applications have a ways to catch up.

Traditional parallel applications are rigid: the user chooses how many nodes (or cores or CPUs) to use when the program starts. If more resources become available, or the application needs to grow, it is stuck. Even worse, if a node is lost due to a failure or a scheduling change, the program must be aborted. Rigid parallelism has been used for many years in dedicated clusters and supercomputers in the form of libraries such as MPI. It works fine for systems of tens or hundreds of nodes, but if you try to go bigger, it gets harder and harder to find a fully reliable system.

In contrast, an elastic parallel application can be modified at run-time to use greater or fewer resources as they become available, or if the size of the problem changes. Typically, an elastic application has one central coordinating node that tracks the progress of the program, and dispatches work units to other nodes. If new nodes are added to the system, the coordinator gives it some work to do. If a node fails or is removed, the coordinator makes a note of this, and sends the work to another node.

If you have an elastic application, then it becomes much easier to harness large scale computing systems such as clouds and grids. In fact, it becomes easier to harness any kind of computer, because you don't have to worry about them being reliable or even particularly fast. It's also useful in a traditional computing center, because you don't have to sit idle waiting for your ideal number of nodes to become free -- you can start work with whatever is available now.

The only problem is, most existing applications are rigidly parallel. Is it feasible to convert them into elastic applications?

We hosted two REU students to address this question: Anthony Canino, from SUNY-Binghamtom, and Zachary Musgrave, from Clemson University. Each took an existing rigid application and converted it into an elastic parallel application using our Work Queue framework. Work Queue has a simple C API, and makes use of a universal Worker executable that can be submitted to multiple remote systems. A Work Queue application looks like this:
Anthony worked on the problem of replica exchange, which is a technique for running molecular simulations in parallel at different energy levels, in order to achieve a more rapid exploration of the energy landscape. Our friends in the Laboratory for Computational Life Sciences down the hall have developed a molecular dynamics engine known as Protomol, and then implemented replica exchange using MPI. Anthony put Protomol and Work Queue together to create an implementation of replica exchange that can run on an arbitrary number of processors, and demonstrated it running on Condor and SGE simultaneously hundreds of nodes. What's even better is that the computation kernel was simply the sequential version of Protomol, so we avoided all of the software engineering headaches that would come with changing the base software.

Zachary worked with the genome annotation tool Maker, which is used to do things like finding protein sequences within an existing genome. Maker was already parallelized using Perl-MPI, so this required Zach to do some reverse engineering to get at the basic algorithm. However, it became clear that the MPI aspect was simply doling out work units to each node, with the additional optimization of work stealing. Zach added a Perl interface to Work Queue, and converted Maker into an elastic application running on hundreds of nodes. We are currently integrating Maker into Biocompute, our local bioinformatics portal.
Speaking of Biocompute, Notre Dame student Brian Kachmarck did a nice job this summer of re-working the user interface to the web site. Not only is it faster and more visually appealing, it also does a better job of presenting the Data-Action-Queue concept described in our recent paper about the system.

Monday, April 5, 2010

The Forty Tribes of Linux

As I have noted in this column before, a perennial challenge of distributed computing in the real world is dealing with the multiplicity of operating systems and related environments. If you are dealing with an uncontrolled environment like a large university or an 'at home' computing environment, there is no telling what you are going to get. If you have a piece of software that depends exactly on the presence of Linux, it just isn't going to work.

You might think that this could be avoided by having a professionally managed environment. At Notre Dame, we have a site license for Red Hat Linux, and our staff are pretty rigorous in keeping everything up to date and on track. But even then, you can't assume everything is identical: there is no way to upgrade everyone simultaneously, and every machine operates on a different schedule (and discipline) for picking up automatic updates. For example, we are currently in the tail of of a general campus migration from Red Hat 4 to Red Hat 5.

Here is some hard evidence. We recently started using the neat 'cron' feature in Condor to make a daily observation of the operating system version, kernel version, and C library version of each machine. With a few variations on condor_status, we can see the upgrade status of the whole system:

The major release numbers (below) aren't too bad. About 3/4 of our cores are running the latest Red Hat, but another 73 machines are behind by a version or two. And, oops, looks like someone plugged in their own personal CentOS machine. Not too hard to deal with, if you are careful to put 'redhat_version' in your requirements:

% condor_status -format "%s\n" redhat_version | sort | uniq -c | sort -rn

782 Red Hat Enterprise Linux Server release 5.4 (Tikanga)
27 Red Hat Enterprise Linux AS release 4 (Nahant Update 7)
26 Red Hat Enterprise Linux Server release 5.3 (Tikanga)
10 Red Hat Enterprise Linux AS release 4 (Nahant Update 8)
10 Red Hat Enterprise Linux WS release 4 (Nahant Update 7)
4 CentOS release 5.3 (Final)

If we go a little deeper, the picture gets murkier. Below are the distribution of Linux kernel versions. Interesting to note that a few are hand-modified for some unusual hardware, and only two are Xen virtualized. Hope that you don't have any code sensitive to the kernel version.

% condor_status -format "%s\n" kernel_version | sort | uniq -c | sort -rn

342 2.6.18-164.9.1.el5
294 2.6.18-164.el5
94 2.6.18-164.10.1.el5
32 2.6.18-164.11.1.el5
32 2.6.9-78.0.13.ELsmp
14 2.6.18-128.7.1.el5
12 2.6.18-164.6.1.el5
10 2.6.18-128.2.1.el5
6 2.6.18-164.2.1.el5
5 2.6.9-78.0.17.ELsmp
4 2.6.9-89.0.20.ELsmp
2 2.6.18-128.4.1.el5
2 2.6.18-164.9.1.el5xen
2 2.6.9-78.0.5.ELsmp
2 2.6.9-89.0.16.ELsmp
2 2.6.9-89.0.9.ELsmp

For completeness, here is the distribution of glibc versions, which has much the same story:

% condor_status -format "%s\n" glibc_version | sort | uniq -c

452 glibc-2.5-42.el5_4.2
296 glibc-2.5-42
34 glibc-2.5-42.el5_4.3
24 glibc-2.3.4-2.41
16 glibc-2.5-34.el5_3.1
14 glibc-2.5-34
13 glibc-2.3.4-2.41.el4_7.1
6 glibc-2.3.4-2.43
4 glibc-2.3.4-2.43.el4_8.1

In the good old days, you could just indicate that a program required OpSys=="LINUX" and more or less expect it to run. That certainly isn't possible now. Perhaps we are misleading users by talking about this thing called Linux, which doesn't really exist in any consistent form. Instead, we should be telling our users that a new operating system gets invented every week, and is usually named after a team on Survivor.

The good folks at Sun tried to solve this problem almost 20 years ago with Java. The idea was that they would create a stable platform that could be implemented on any machine. Then, you could write programs that would be universally portable. The problem was, well...

% condor_status -format "%s " JavaVendor -format "%s\n" JavaVersion | sort | uniq -c | sort -rn
308 Sun Microsystems Inc. 1.6.0
222 Sun Microsystems Inc. 1.6.0_15
174 Sun Microsystems Inc. 1.6.0_17
52 Free Software Foundation, Inc. 1.4.2
28 Sun Microsystems Inc. 1.6.0_18
3 Sun Microsystems Inc. 1.5.0_17
2 Apple Computer, Inc. 1.5.0_19

Many people think the grand solution to this problem is virtual machines. Perhaps, but more on that next time.

Thursday, January 21, 2010

Summer REU at Notre Dame

We invite outstanding undergraduates to apply for summer research
positions in scientific and cloud computing at the University of Notre Dame.
Students will build and operate systems that harness hundreds of
machines at once to attack large problems in science and engineering.

Research topics include:
  • Green Cloud Computing
  • Portals for Scientific Research
  • Languages for Distributed Computing
More information is available here:


Applications received by March 1st will be given first consideration.

Monday, January 11, 2010

Green Cloud Online

The Green Cloud is now online!

The Green Cloud is the invention of Dr. Paul Brenner at the ND Center for Research Computing. It is a containerized data center located at the South Bend city greenhouse, stocked with used servers kindly donated by Ebay, Inc. The first batch of machines was installed in December, and will eventually reach about 400 cores once everything is turned on.

What makes the data center unique is that is has no air conditioning. Instead, the data center takes in ambient air, and then exhausts it into the greenhouse. This benefits Notre Dame, since we no longer pay the cost of cooling, but it also benefits the greenhouse, which has significant heating costs during the winter months. (We used to call this idea grid heating.)

Of course, this means the capacity of the system may change with the weather. During the winter, the system can run at full blast and deliver as much heat as possible to the greenhouse. During the spring and fall, the heat may not be needed, and can be vented outdoors. During the hottest part of the summer, we may need to shut some machines down to get the temperature under control. However, recent studies by big data center operators suggest that machine temperature could be safely increased to 80 or 90 degrees, so there may be a fair amount of headroom available. We will see.

For a normal data center that runs web servers and databases, shutting down machines is not really an option. However, the Green Cloud provides fungible computing power for large computations in science and engineering at Notre Dame. If structured correctly, these workloads can adapt to 10 or 100 or 1000 cores. So, turning machines on and off will affect performance, but not correctness.

A good example of a flexible workload is genome assembly. Two of our students, Christopher Moretti and Michael Olson presented initial results on a Scalable Genome Assembler at the MTAGS Workshop held at Supercomputing 2009. Their assembler uses our Work Queue framework to manage a variable workforce, pushing out sequence fragments to whatever machines are available. The system has scaled up to about 1000 cores, spread across the Notre Dame campus, the Green Cloud, Purdue University, and the University of Wisconsin.

We are currently working on a journal paper and an open source release of the assembler, so stay tuned for details.