Wednesday, May 20, 2015

Writing Solid Tests is (Still) Hard

We have a nice little automatic build-and-test system for the Cooperative Computing Tools which has nicely brought together the capabilities of Github, Condor, and Docker.

Every proposed merge to the codebase is packaged up as a build job which is dispatched to our Condor pool.  Some of those jobs run natively on bare hardware, some jobs run on virtual machines, and some are running in Docker containers, but all of them are managed by Condor, so we can have a zillion builds going simultaneously without too much conflict.

The result is that anytime someone proposes a pull request, it gets run through the system and a few minutes later we get a new row on the web display that shows whether each platform built and tested correctly.  It's very handy, and provides for objective evaluation and gentle pressure on those who break the build.

(I should point out here that Patrick Donnelly and Ben Tovar have done a bang-up job of building the system, and undergraduate student Victor Hawley added the Docker component.)

Some days the board is all green, and some days it looks more like this:



But the hardest part of this seems to be writing the tests properly.  Each test is a little structured script that sets up an environment, runs some component of our software, and then evaluates the results.  It might start up a Chirp server and run some file transfers, or run Parrot on a tricky bit of Unix code, or run Makeflow on a little workflow to see if various options work correctly.

Unfortunately, there are many ways that the tests can fail without revealing a bug in the code! We recently added several platforms to the build, resulting in a large number of test failures.  Some of these were due to differences between Unix utilities like sh, dd, and sed on the various machines. Others were more subtle, resulting from race conditions in concurrent actions.  (For example, should you start a Master in the foreground and then a Worker in the background, or vice versa.)  There is a certain art to being able to write a shell script that is portable and robust.

There is also a tension in the complexity of the tests.  On one hand, you want short, focused tests that exercise individual features, so that they can be completed in a few minutes at give immediate feedback.

On the other hand, you also want to run big complex applications, so as to test the system at scale and under load.  We don't really know that a given release of Parrot works at scale until it has run on 10K cores for a week for a CMS physics workload.  If each core consumes 30W of power over 7 days, that's a 50 megawatt-hour test!  Yikes!

Better not run that one automatically.


Monday, May 19, 2014

Toward a Common Model of Highly Concurrent Programming

(This is the short version of a talk I gave at the MTAGS workshop at Supercomputing 2013.  See the slides here.)

Historically, highly concurrent programming has been closely associated with high performance computing.  Two programming models have been dominant: shared memory machines in which concurrency was expressed via multiple threads, and distributed memory machines in which concurrency was expressed via explicit message passing.  It is widely agreed that both of these programming models are very challenging, even for the veteran programmer.  In both cases, the programmer is directly responsible for designing the program from top to bottom and handling all of the issues of granularity, consistency, and locality necessary to achieve acceptable performance, with very little help from the runtime or operating systems.

However, a new approach to concurrent programming has been emerging over the last several years, in which the user programs in a much higher level language and relies upon the system to handle many of the challenging underlying details.  To achieve this, the program is successively decomposed into simpler representations, such that each layer of the system can gradually adapt it to the hardware available.

The layers can be described as follows:
  • declarative language (DL) for compactly representing a complete program.
  • directed graph (DAG) to represent the expanded program and its resources.
  • bag of independent tasks (BOT) with explicit input and output dependencies.
  • A shared-nothing cluster to which data and tasks must be assigned.
Several different research communities have arrived at this computing model somewhat independently: the high performance computing community, the scientific workflow community, and the cloud computing community.  In each case, the scale and complexity of the systems in use eventually made it impossible for the programmer or the user to take responsibility for all of the challenges of parallel/distributed computing.  Although each community employs different technologies and has distinct optimization goals, the overall structure of these systems is surprisingly similar.

A (very incomplete) selection of systems that follow this model:


Layer Cloud Stack Workflow Stack HPC Stack
Declarative Language (DL)    Pig Weaver Swift-T
Directed Acyclic Graph (DAG)Map-Reduce    Makeflow -
Bag of Tasks (BOT)JobTracker Work Queue Master Turbine
Distributed Data HDFS Work Queue Workers    MPI

Each layer of the system fulfills a distinct need.  The declarative language (DL) at the top is compact, expressive, and easy for end users, but is intractable to analyze in the general case because it may have a high order of complexity, possibly Turing-complete.  The DL can be used to generate a (large) directed acyclic graph (DAG) that represents every single task to be executed.  The DAG is not a great user-interface language, but it is much more suitable for a system to perform capacity management and optimization because it is a finite structure with discrete components.  A DAG executor then plays the graph, dispatching individual tasks as their dependencies are satisfied.  The BOT consists of all the tasks that are ready to run, and is then scheduled onto the underlying computer, using the data dependencies made available from the higher levels.

Why bother with this sort of model?  It allows us to compare the fundamental capabilities and expressiveness of different kinds of systems.  For example, in the realm of compilers, everyone knows that a proper compiler consists of a scanner, a parser, an optimizer, and a code generator.  Through these stages, the input program is transformed from a series of tokens to an abstract syntax tree, an intermediate representation, and eventually to assembly code.  Not every compiler uses all of these stages, much less the same code, but by using a common language, it is helpful to understand, compare, and design new systems.


Friday, February 14, 2014

Visualizing 10,000 Cores

Our Condor pool at the University of Notre Dame has been slowly growing, in no small part due to our collaboration with the Center for Research Computing, where it is now scavenging unused cycles from HPC clusters at the CRC.  When the dedicated batch system leaves a node unused, Condor is started on that node and keeps going until the dedicated system wants the node back.  Depending on the time of year, that leaves anywhere between 4K and 10K nodes available in the Condor pool.

We have tried a number of approaches at visualizing this complex system over the years.  Our latest tool, the Condor Matrix Display started as a summer project by Nick Jaeger, a student from the University of Wisconsin at Eau Claire.  The display shows a colored bar for each slot in the pool, where the width is proportional to the number of cores.

With a quick glance, you can see how many users are busy and whether they are running "thin" (1 core) or "fat" (many core) jobs.  Sorting by the machine name gives you sense of how each sub-cluster in the pool is used:


While sorting by users gives you a sense of what users are dominating the pool:


The display is always a nice way of viewing the relatively new feature of "dynamic slot" in Condor.  A large multi-core machine is now represented as a single slot with multiple resources.  For example, this bit of the display shows a cluster of 8-core machines where some of the machines are unclaimed (green), some are running 4-core jobs (blue), and some are running 1-core jobs (green):



Monday, February 6, 2012

Some Open Computer Science Problems in Workflow Systems

In the previous article, I extolled the virtues of Makeflow, which has been very effective at engaging new users and allowing them to express their workflows in a way that facilitates parallel and distributed computing. We can very consistently get new users going from one laptop to 100 cores in a distributed system very easily.

However, as we develop experience in scaling up workflows to thousands of cores across wide area distributed systems, a number of interesting computer science challenges have emerged. These problems are not specific to Makeflow, but can be found in most workflow systems:

Capacity Management
Just because a workflow expresses thousand-way concurrency doesn't mean that it is actually a good idea to run it on one thousand nodes! The cost of moving data to and from the execution nodes may outweigh the benefit of the added computational power. If one uses fewer nodes than the available parallelism, then it may be possible to pay the data movement cost once, and then exploit it multiple times. For most workflows, there is a "sweet spot" at which performance is significantly maximized. Of course, users don't want to discover this by experiment, they need some tool to recommend an appropriate size for the given workflow.

Software Engineering Tools
A workflow is just another kind of program: it has source code that must be managed, dependencies that must be gathered, and a history of modification to be tracked. In this sense, we are badly in need of tools for manipulating workflows in coherent ways. For example, we need a linker that can take a workflow, find all the dependent components, and gather them together in one package. We need a loader that can take an existing workflow, load it into a computing system, and then update file names and other links to accomodate it. We need a profiler that can report on the time spent across multiple runs of a workflow, so as to determine where problem spots may be.

Portability and Reproducibility
Makeflow itself enables portability across execution systems. For example, you can run your application on Condor or SGE without modification. However, that doesn't mean that your applications are actually portable. If one cluster runs Blue Beanie Linux 36.5 and another runs Green Sock Linux 82.7, your chances of the same executable running on both are close to zero. Likewise, if you run a workflow one day, then set it aside for a year, it's possible that your existing machine has been updated to the point where the old workflow no longer runs.

However, if we also explicitly state the execution environment in the workflow, then this can be used to provide applications with what they need to run. The environment might be as simple as a directory structure with the applications, or as complex as an entire virtual machine. Either way, the environment becomes data that must be managed and moved along with the workflow, which affects the performance and cost issues discussed above.

Composability
Everything in computing must be composable. That is, once you get one component working, the very next step is to hook it up to another so that it runs as a subcomponent. While we can technically hook up one Makeflow to another, this doesn't currently happen in a way that results in a coherent program. For example, the execution method and resource limits don't propagate from one makeflow to another. To truly enable large scale structures, we need a native method of connecting workflows together that connects not only the control flow, but the resource allocation, capacity management, and everything else discussed above.

Effortless Scalability

As a rule of thumb, I tell brand new users that running a Makeflow on 10 cores simultaneously is trivial, running on 100 cores is usually easy, and getting to 1000 cores will require some planning and debugging. Going over 1000 cores is possible (our largest system is running on 5000 cores) but requires a real investment of time by the user.

Why does scale make things harder? One reason is that computer systems are full of artificial limits that are not widely know or managed effectively. On a Unix-like system, a given process has a limited number of file descriptors per process and a limited number of files per directory. (Most people don't figure this out until they hit the limit, and then the work must be restructured to accomodate.) A complex network with translation devices may have a limited number of simultaneously network connections. A data structure that was small to ignore suddenly becomes unmanageable when there are 10,000 entries.

To have a software system that can scale to enormous size, you need to address these known technical issues, but also have methods of accomodating limits that you didn't expect. You also need an architecture that can scale naturally and observe its own limits to understand when they are reached. An ideal implementation would know its own limits and not require additional experts in order to scale up.

---

Each of these problems, though briefly described, are pretty hefty problems once you start digging into them. Some of them are large enough to earn a PhD. (In fact, some are already in progress!) They all have the common theme of making data intensive workflows manageable, useable, portable, and productive across a wide variety of computing systems.

More to follow.

Wednesday, February 1, 2012

Why Makeflow Works for New Users

In past articles, I have introduced Makeflow, which is a large scale workflow engine that we have created at Notre Dame.

Of course, Makeflow is certainly not the first or only workflow engine out there. But, Makeflow does have several unique properties that make it an interesting platform for bringing new people into the world of distributed computing. And, it is the right level of abstraction that allows us to address some fundamental computer science problems that result.

Briefly, Makeflow is a tool that lets the user express a large number of tasks by writing them down as a conventional makefile. You can see an example on our web page. A Makeflow can be just a few rules long, or it can consist of hundreds to thousands of tasks, like this EST pipeline workflow:

Once the workflow is written down, you can then run Makeflow in several different ways. You can run it entirely on your workstation, using multiple cores. You can ask Makeflow to send the jobs to your local Condor pool, PBS or SGE cluster, or other batch system. Or, you can start the (included) Work Queue system on a few machines that you happen to have handy, and Makeflow will run the jobs there.

Over the last few years, we have had very good experience getting new users to adopt Makeflow, ranging from highly sophisticated computational scientists all the way to college sophomores learning the first principles of distributed computing. There are a couple of reasons why this is so:
  • A simple and familiar language. Makefiles are already a well known and widely used way of expressing dependency and concurrency, so it is easy to explain. Unlike more elaborate languages, it is brief and easy to read and write by hand. A text-based language can be versioned and tracked by any existing source control method.
  • A neutral interface and a portable implementation. Nothing in a Makeflow references any particular batch system or distributed computing technology, so existing workflows can be easily moved between computing systems. If you I use Condor and you use SGE, there is nothing to prevent my workflow from running on your system.
  • The data needs are explicit. A subtle but significant difference between Make and Makeflow is that Makeflow treats your statement of file dependencies very seriously. That is, you must state exactly which files (or directories) that your computation depends upon. This is slightly inconvenient at first, but vastly improves the ability of Makeflow to create the right execution environment, verify a correct execution, and manage storage and network resources appropriately.
  • An easy on-ramp to large resources. We have gone to great lengths to make it absolutely trivial to run Makeflow on a single machine with no distributed infrastructure. Using the same framework, you can move to harnessing a few machines in your lab (with Work Queue) and then progressively scale up to enormous scale using clusters, clouds, and grids. We have users running on 5 cores, 5000 cores, and everything in between.
Of course, our objective is not simply to build software. Makeflow is a starting point for engaging our research collaborators, which allows us to explore some hard computer science problems related to workflows. In the next article, I will discuss some of those challenges.


Tuesday, November 16, 2010

The Virtualization Theorem Ignored for Three Decades

Today, in my graduate operating systems class, we discussed what I believe is the most important result in computer science ever to be persistently ignored:

Popek and Goldberg, Formal Requirements for Virtualizible Third Generation Architectures, Communications of the ACM, Volume 17, Issue 7, July 1974.

This paper puts forth a very simple principle that must be observed in order for a CPU to be capable of running in a virtual machine. First, two definitions:
  • A sensitive instruction reads or modifies supervisor state
  • A privileged instruction traps if attempted in user mode.
And this central theorem:
  • All sensitive operations must be privileged.
Here is why this is important. A conventional operating system (OS) is in charge of the whole machine, and is free to modify the processor status, page tables, I/O devices, and other sensitive aspects of the machine in order to run normal processes.

But, if you take that OS and put it in a virtual machine (VM), it is no longer in charge of the whole machine. All of those actions on sensitive state must be translated in some way by the virtual machine monitor. The simplest way to accomplish that translation is to run the OS in user mode, allowing the VMM to execute sensitive operations on its behalf. To make sure that the VMM gets all of the sensitive operations, they must all be forced to trap.

This principle was articulated very clearly in 1974, when virtualization was already a widely applied technique in the world of mainframe computing. Unfortunately, the principle didn't make the leap into the microcomputer world. In fact, there was an enduring tradition of releasing processors that were not virtualizable, only to realize the mistake and issue a second version with a minor fix.

For example, the venerable Motorola 68000 was first released in 1978, and was heralded as a "mainframe on a chip". Except, it had one little problem: a read from the sensitive status register did not trap, preventing the construction of a virtual memory system. So, Motorola issued the 68010, which was almost identical, except that a read from the status register forced a trap, enabling correct virtualization of memory.

Unfortunately, not everybody got the memo.

For nearly three decades, the Intel x86 series of processors did not have this property. In user mode, many instructions could be used to view sensitive state, and many attempts to write sensitive state would fail silently without a trap. From the 1970s until the late 1990s, efficient virtualization was basically impossible on the most widely used processor family.

Around the year 2000, virtualization became of interest as a way to service the multi-tenancy needs of large internet services. A number of solutions were developed simultaneously to work around the limitations of the Intel chips. One approach used in VMWare was to dynamically compile assembly code at runtime to convert sensitive instructions into deliberate traps to the VMM. Another approach used in the Xen hypervisor was to modify the operating system code so that it explicitly called the VMM instead of invoking sensitive instructions.

There are many other approaches to working around the limitation. Suffice to say that they are all rather complicated, but they can be made to work.

Finally, in 2005, both Intel and AMD introduced virtualization extensions to their processors, enabling basic trap-and-execute virtualization, only 29 years after the Popek and Goldberg theorem was widely circulated.

So, what's the moral of the story?

Monday, November 8, 2010

Sometimes It All Comes Together

Most days, software engineering involves compromises and imperfect solutions. It's rare for two pieces of software to mesh perfectly -- you always have to work to overcome the limitations or assumptions present in different modules. But, every once in a while, the pieces just come together in a satisfying way.

A few weeks back, we ran into a problem with BXGrid, our system for managing biometric data. Our users had just ingested a whole pile of new images and videos, and were browsing and validating the data. Because data was recently ingested, no thumbnails had been generated yet, so every screenful required a hundred or so thumbnails to be created from the high resolution images. Multiple that by each active user, and you have a web site stuck in the mud.

A better solution would be to generate all of the missing thumbnails offline in an orderly way. Since many of the transcoding operations are compute intensive, it makes sense to farm them out to a distributed system.

Peter Bui -- a graduate student in our group -- solved this problem elegantly by putting together almost all of our research software simultaneously. He used Weaver as the front-end language to query BXGrid and determine what thumbnails needed to be generated. Weaver generated a Makeflow to perform all of the transcodings. Makeflow used Work Queue to execute the tasks, with the Workers submitted to our campus Condor pool.

So far, so good. But, the missing piece was that Makeflow expects data to be available as ordinary files. At the time, this would have required that we copy several terabytes out of the archive onto the local disk, which wasn't practical. So, Peter wrote a module for Parrot which enabled access to BXGrid files under paths like /bxgrid/fileid/123. Then, while attached to Parrot, Makeflow could access the files, being none the wiser that they were actually located in a distributed system.

Put it all together, and you have this:

Sometimes, it all comes together.