Monday, May 19, 2014

Toward a Common Model of Highly Concurrent Programming

(This is the short version of a talk I gave at the MTAGS workshop at Supercomputing 2013.  See the slides here.)

Historically, highly concurrent programming has been closely associated with high performance computing.  Two programming models have been dominant: shared memory machines in which concurrency was expressed via multiple threads, and distributed memory machines in which concurrency was expressed via explicit message passing.  It is widely agreed that both of these programming models are very challenging, even for the veteran programmer.  In both cases, the programmer is directly responsible for designing the program from top to bottom and handling all of the issues of granularity, consistency, and locality necessary to achieve acceptable performance, with very little help from the runtime or operating systems.

However, a new approach to concurrent programming has been emerging over the last several years, in which the user programs in a much higher level language and relies upon the system to handle many of the challenging underlying details.  To achieve this, the program is successively decomposed into simpler representations, such that each layer of the system can gradually adapt it to the hardware available.

The layers can be described as follows:
  • declarative language (DL) for compactly representing a complete program.
  • directed graph (DAG) to represent the expanded program and its resources.
  • bag of independent tasks (BOT) with explicit input and output dependencies.
  • A shared-nothing cluster to which data and tasks must be assigned.
Several different research communities have arrived at this computing model somewhat independently: the high performance computing community, the scientific workflow community, and the cloud computing community.  In each case, the scale and complexity of the systems in use eventually made it impossible for the programmer or the user to take responsibility for all of the challenges of parallel/distributed computing.  Although each community employs different technologies and has distinct optimization goals, the overall structure of these systems is surprisingly similar.

A (very incomplete) selection of systems that follow this model:

Layer Cloud Stack Workflow Stack HPC Stack
Declarative Language (DL)    Pig Weaver Swift-T
Directed Acyclic Graph (DAG)Map-Reduce    Makeflow -
Bag of Tasks (BOT)JobTracker Work Queue Master Turbine
Distributed Data HDFS Work Queue Workers    MPI

Each layer of the system fulfills a distinct need.  The declarative language (DL) at the top is compact, expressive, and easy for end users, but is intractable to analyze in the general case because it may have a high order of complexity, possibly Turing-complete.  The DL can be used to generate a (large) directed acyclic graph (DAG) that represents every single task to be executed.  The DAG is not a great user-interface language, but it is much more suitable for a system to perform capacity management and optimization because it is a finite structure with discrete components.  A DAG executor then plays the graph, dispatching individual tasks as their dependencies are satisfied.  The BOT consists of all the tasks that are ready to run, and is then scheduled onto the underlying computer, using the data dependencies made available from the higher levels.

Why bother with this sort of model?  It allows us to compare the fundamental capabilities and expressiveness of different kinds of systems.  For example, in the realm of compilers, everyone knows that a proper compiler consists of a scanner, a parser, an optimizer, and a code generator.  Through these stages, the input program is transformed from a series of tokens to an abstract syntax tree, an intermediate representation, and eventually to assembly code.  Not every compiler uses all of these stages, much less the same code, but by using a common language, it is helpful to understand, compare, and design new systems.

Friday, February 14, 2014

Visualizing 10,000 Cores

Our Condor pool at the University of Notre Dame has been slowly growing, in no small part due to our collaboration with the Center for Research Computing, where it is now scavenging unused cycles from HPC clusters at the CRC.  When the dedicated batch system leaves a node unused, Condor is started on that node and keeps going until the dedicated system wants the node back.  Depending on the time of year, that leaves anywhere between 4K and 10K nodes available in the Condor pool.

We have tried a number of approaches at visualizing this complex system over the years.  Our latest tool, the Condor Matrix Display started as a summer project by Nick Jaeger, a student from the University of Wisconsin at Eau Claire.  The display shows a colored bar for each slot in the pool, where the width is proportional to the number of cores.

With a quick glance, you can see how many users are busy and whether they are running "thin" (1 core) or "fat" (many core) jobs.  Sorting by the machine name gives you sense of how each sub-cluster in the pool is used:

While sorting by users gives you a sense of what users are dominating the pool:

The display is always a nice way of viewing the relatively new feature of "dynamic slot" in Condor.  A large multi-core machine is now represented as a single slot with multiple resources.  For example, this bit of the display shows a cluster of 8-core machines where some of the machines are unclaimed (green), some are running 4-core jobs (blue), and some are running 1-core jobs (green):