Friday, February 14, 2014
Visualizing 10,000 Cores
We have tried a number of approaches at visualizing this complex system over the years. Our latest tool, the Condor Matrix Display started as a summer project by Nick Jaeger, a student from the University of Wisconsin at Eau Claire. The display shows a colored bar for each slot in the pool, where the width is proportional to the number of cores.
With a quick glance, you can see how many users are busy and whether they are running "thin" (1 core) or "fat" (many core) jobs. Sorting by the machine name gives you sense of how each sub-cluster in the pool is used:
While sorting by users gives you a sense of what users are dominating the pool:
The display is always a nice way of viewing the relatively new feature of "dynamic slot" in Condor. A large multi-core machine is now represented as a single slot with multiple resources. For example, this bit of the display shows a cluster of 8-core machines where some of the machines are unclaimed (green), some are running 4-core jobs (blue), and some are running 1-core jobs (green):
Tuesday, December 9, 2008
Visualizing Clusters in Real Time


This display shows the total CPU time consumed by different users over the last year. A few students have really racked up a lot of computation!

All of these displays rotate automatically on our "scoreboard", which is a monitor in a public hallway of the Engineering building at Notre Dame. It's not unusual for me to step out of my office only to find a few students checking to see who scored the highest over the last week! We ought to give a yearly award for the students who makes the best use of the system.
On a more philosophical note, the displays really help to make our work more concrete to outsiders. Because we work with intangible software instead of test tubes or fissile material, we don't have a "lab" to show visitors or prospective students. A live picture draws people in: random people from other departments stop in the hallway to look at the scoreboard and ask what is going on. A little effort put into "advertising" goes a long way.
Sunday, November 30, 2008
Visualizing a Large Distributed System with Enavis
The picture doesn't really do it justice: you can grab, twist, and scroll the view, and the graph reacts in real-time. It's really quite fun to play around with. You can use it to debug performance problems, chase down intruders, or just observe system behavior over time.
The challenge with any visualization is deciding what small part of the available data to display. Lockdown collects an enormous amount of data: anytime a program makes a network connection, we record the host, user, program, and port numbers. This data has been recorded continuously across hundreds of machines for about a year now. Even if you pick one moment in time, you cannot possible display all of the active data in any reasonable way.
Instead, you begin by a known starting location and a point in time, say user 33 last Thursday. What you get is a graph with user 33 at the center, out to a radius of one. If you want to see more, increase the radius, and the view expands:
The meta-visualization represents hosts (H), users (U), and applications (A). You simply click on the graph to add or remove edges and modify the main display. For example, if the user adds an edge between H and U, then the main graph will show the relationship between hosts and users. If H has a circular link, then the main graph will show which hosts are talking to each other. The meta-visualization is a nice compact way of representing all 63 possible slices of the data.
For more information, you can read the paper about Enavis or visit the Lockdown website.