Prof. Douglas Thain: visualizing

Showing posts with label visualizing. Show all posts

Friday, February 14, 2014

Visualizing 10,000 Cores

Our Condor pool at the University of Notre Dame has been slowly growing, in no small part due to our collaboration with the Center for Research Computing, where it is now scavenging unused cycles from HPC clusters at the CRC. When the dedicated batch system leaves a node unused, Condor is started on that node and keeps going until the dedicated system wants the node back. Depending on the time of year, that leaves anywhere between 4K and 10K nodes available in the Condor pool.

We have tried a number of approaches at visualizing this complex system over the years. Our latest tool, the Condor Matrix Display started as a summer project by Nick Jaeger, a student from the University of Wisconsin at Eau Claire. The display shows a colored bar for each slot in the pool, where the width is proportional to the number of cores.

With a quick glance, you can see how many users are busy and whether they are running "thin" (1 core) or "fat" (many core) jobs. Sorting by the machine name gives you sense of how each sub-cluster in the pool is used:

While sorting by users gives you a sense of what users are dominating the pool:

The display is always a nice way of viewing the relatively new feature of "dynamic slot" in Condor. A large multi-core machine is now represented as a single slot with multiple resources. For example, this bit of the display shows a cluster of 8-core machines where some of the machines are unclaimed (green), some are running 4-core jobs (blue), and some are running 1-core jobs (green):

Tuesday, December 9, 2008

Visualizing Clusters in Real Time

The end of the semester is nearing, so activity in our distributed system really shoots up as undergraduates finish their semester projects and graduate students hurry to generate those last few research results. You can see this activity reflected in a number of visual displays that we have created to track activity in the system.

For example, the following is a snapshot of an applet that displays the current state of all machines in our Chirp storage cluster. Each machine is represented by a box, where colors indicate resources on each machine (memory, disk, cpu), and arrows indicate active network transfers. You can click on the following snapshot, or view the current live display if you like.

With a quick view, you can see four different things going on. Two students are working on their semester project in distributed systems, which is to create a dynamic cloud of web servers. This is reflected in the spread of network transfers going from two submit machines near the top, transmitting web server data to the hosts where they happen to run. One student is running a large All-Pairs job: this is reflected in the small vertical arrows about one third of the way down, indicating heavy disk traffic local to each machine. Closer to the bottom, someone is keeping the CPUs busy on the cluster named sc0-xx: this cluster runs Hadoop, so it is probably a Map-Reduce job.

Of course, this kind of display is only good for the high level immediate picture. It only tells you what is going on in the last minute. For a more historical view, we track and publish data from our Condor distributed batch system. For example, this shows the CPU utilization of our system over the last week. The red part shows the number of CPUs in use by the person at the keyboard, the green part shows the number of CPUs idle, and the blue part shows the idle CPUs harnessed by Condor. As you can tell, things picked up in the middle of the week:

This display shows the total CPU time consumed by different users over the last year. A few students have really racked up a lot of computation!

All of these displays rotate automatically on our "scoreboard", which is a monitor in a public hallway of the Engineering building at Notre Dame. It's not unusual for me to step out of my office only to find a few students checking to see who scored the highest over the last week! We ought to give a yearly award for the students who makes the best use of the system.

On a more philosophical note, the displays really help to make our work more concrete to outsiders. Because we work with intangible software instead of test tubes or fissile material, we don't have a "lab" to show visitors or prospective students. A live picture draws people in: random people from other departments stop in the hallway to look at the scoreboard and ask what is going on. A little effort put into "advertising" goes a long way.

Sunday, November 30, 2008

Visualizing a Large Distributed System with Enavis

Two students at Notre Dame, Qi Liao and Andrew Blaich, recently received the Best Paper award at USENIX LISA for their work on Enavis, a tool that gives a visual display of network traffic collected by the Lockdown network administration tool. Enavis gives the administrator of a large network a way to browse all of the users, programs, hosts, and network connections in a system of hundreds or thousands of machines. Here is what it looks like:

The picture doesn't really do it justice: you can grab, twist, and scroll the view, and the graph reacts in real-time. It's really quite fun to play around with. You can use it to debug performance problems, chase down intruders, or just observe system behavior over time.

The challenge with any visualization is deciding what small part of the available data to display. Lockdown collects an enormous amount of data: anytime a program makes a network connection, we record the host, user, program, and port numbers. This data has been recorded continuously across hundreds of machines for about a year now. Even if you pick one moment in time, you cannot possible display all of the active data in any reasonable way.

Instead, you begin by a known starting location and a point in time, say user 33 last Thursday. What you get is a graph with user 33 at the center, out to a radius of one. If you want to see more, increase the radius, and the view expands:

There are many different ways to slice and filter the data. In the simplest case, you might be interested in known which hosts are talking to each other, or which programs or talking to each other, or which users are talking to each other. Or, you might want a mix: show what users are talking to each other, via which programs. To control all of these possibilities, Enavas has a meta-visualization: a graph that controls which data to display:

The meta-visualization represents hosts (H), users (U), and applications (A). You simply click on the graph to add or remove edges and modify the main display. For example, if the user adds an edge between H and U, then the main graph will show the relationship between hosts and users. If H has a circular link, then the main graph will show which hosts are talking to each other. The meta-visualization is a nice compact way of representing all 63 possible slices of the data.

For more information, you can read the paper about Enavis or visit the Lockdown website.