Prof. Douglas Thain: grid computing

Showing posts with label grid computing. Show all posts

Thursday, October 1, 2009

Partly Cloudy with a Chance of Condor

We have been thinking about cloud computing quite a bit over the last month. As I noted earlier, cloud computing is hardly a new idea, but it does add a few new twists on some old concepts in distributed systems. So, we are spending some time to understand how we can take our existing big applications and make them work with cloud systems and software. It should come as no surprise that there are a number of ways to use Condor to harness clouds for big applications.

Two weeks ago, I gave a talk titled Science in the Clouds at an NSF workshop on Cloud Computing and the Geosciences. One of the points that I made was that although clouds make it easy to allocate new machines that have exactly the environment you want, they don't solve the problem of work management. That is, if you have one million tasks to do, how do you reliably distribute them between your workstation, your campus computer center, and your cloud workforce? For this, you need some kind of job execution system, which is largely what grid computing has focused on:

As it stands, Condor is pretty good at managing work across multiple different kinds of systems. In fact, today you can go to a commercial service like Cycle Computing, who can build an on-demand Condor pool by allocating machines from Amazon:

Just today, we hosted Dhruba Borthakur at Notre Dame. Dhruba is the project lead for the open source Apache Hadoop system. We are cooking up some neat ways for Condor and Hadoop to play together. As a first step, one of my students Peter Bui has cooked up a module for Parrot that talks to HDFS, the Hadoop file system. This allows any Unix program -- not just Java -- talk to HDFS, without requiring the kernel configuration and other headaches of using FUSE. Then, you can submit your jobs into a Condor pool and allow them to access data in HDFS as if it were a local file system. The next step is to co-locate the Condor jobs with the Hadoop data that they want to access.

Finally, if you are interested in cloud computing, you should attend CCA09 - Cloud Computing and Applications - to be held in Chicago on October 20th. This will be a focused, one day meeting with speakers from industry, academia who are both building and using cloud computers.

Tuesday, June 2, 2009

Grid Heating: Putting Data Center Heat to Productive Use

Dr. Paul Brenner, a research scientist in the Computing Research Center at the University Notre Dame, has been advocating a novel idea called grid heating. He recently won a "Green IT Award" from the Uptime Institute for his work. Here is a short introduction to the idea:

Around the world, large data centers consume enormous amounts of power. In addition to the energy needed to spin disks and rearrange electrons, an approximately equal amount of power is needed to run the air conditioners and fans to remove that heat from the data center. In this sense, data centers are doubly inefficient, because they are using power to both heat and cool the same space. If we could put that heat to productive use, then we could save energy on cooling the data center, as well as save energy that would have otherwise been used to generate heat.

Last year. Dr. Brenner constructed a prototype of this idea at the city greenhouse in South Bend, which was struggling with enormous heating bills during the winter. He constructed a small cluster, and placed it in the Arizona Desert display in the greenhouse, where the plants need the highest temperature. Notre Dame paid the electricity bill, the greenhouse got the benefit of the heat, and the computers simply joined our campus Condor pool. Everybody wins, and nobody has to pay an air conditioning bill.

However, the first cluster was just a prototype, and couldn't generate nearly enough heat for the entire greenhouse. So, this year, Dr. Brenner is building a small data center in a modular shipping container. next to the greenhouse. With a new electricity and network hookup, the data center will run several hundred CPUs, and function as a secondary furnace for the facility, hopefully reducing the heating bill by half over the winter.

The new facility will significantly add to our campus grid, and will also give us some interesting scheduling problems to work on. The greenhouse needs heat the most during the winter, and to a lesser extent during the summer, so the computing capacity of the system will change with the seasons. Further, the price of electricity varies significantly during the day, so jobs run in the dead of night may be cheaper than those run during the day. If we can connect our "campus grid" to the "smart electric grid", we can make the system automatically schedule around these constraints.

Here are some recent articles about Grid Heating:

Thursday, December 11, 2008

Abstractions, Grids, and Clouds at IEEE e-Science 2008

I just attended the IEEE conference on e-Science in Indianapolis, and gave this talk on harnessing distributed systems with high level abstractions.

Another highlight of the conference was Rich Wolski's talk on Eucalyptus, an open source toolkit for cloud computing. Like Nimbus, it is API compatible with Amazon's EC2. That is, if your code works with Amazon, you can install Nimbus or Eucalyptus on your on cluster and run your own cloud.

Rich also spoke on the distinction between clouds and grids, which to many are the same idea. However, he pointed out a very important distinction:

- Clouds provide a resource allocation service. You ask for a certain number of machines, you get them for a certain amount of time, and you can choose to use them however you like. This gives you guaranteed service, which is great if you are running a web server, but can lead to underutilization if your goal is to run a large number of simulations.

- Grids provide a task execution service. You submit work to be executed, and the system decides where and when to execute the tasks, perhaps preempting or migrating them along the way. You have no guarantees of service, but the system will get very high utilization, making it good for high throughput computing.

As such, you can combine the two ideas together. For example, Cycle Computing provides a value added web service that employs both Amazon and Condor. You request a certain number of CPUs to execute a certain number of tasks. Cycle allocates the CPUs using Amazon, installs Condor on the nodes, and then runs the jobs. The result is a grid running on a cloud.

Monday, October 6, 2008

Troubleshooting Distributed Systems via Data Mining

One of our students, David Cieslak, just presented this paper on troubleshooting large distributed systems at the IEEE Grid Computing Conference in Japan. Here's the situation:

You have a million jobs to run.
You submit them to a system of thousands of CPUs.
Half of them complete, and half of them fail.

Now what do you do? It's hopeless to debug any single failure, because you don't know if it represents the most important case. It's a thankless job to dig through all of the log files of the system to see where things went wrong. You might try re-submitting all of the jobs, but chances are that half of those will fail, and you will just be wasting more time and resources pushing them to the finish.

Typically, these sorts of errors come arise from an incompatibility of one kind or another. The many versions of Linux present the most outrageous examples. Perhaps your job assumes that the program wget can be found on any old Linux machine. Oops! Perhaps your program is dynamically linked against SSL version 2.3.6.5.8.2.1, but some machines only have version 2.3.6.5.8.1.9. Oops! Perhaps your program crashes on a machine with more than 2GB of physical memory, because it performs improper pointer arithmetic. Oops!

So, to address this problem, David has constructed a nice tool that reads in some log files, and then diagnoses the properties of machines or jobs associated with failures, using techniques from the field of data mining. (We implemented this on log files from Condor, but you could apply the principle to any similar system.) Of course, the tool cannot diagnose the root cause, but it can help to narrow down the source of the problem.

For example, consider the user running several thousand jobs on our 700 CPU Condor pool. Jobs tended to fail within minutes on certain set of eleven machines. Of course, as soon as those jobs failed, the machines were free to run more jobs, which promptly failed. By applying GASP, we discovered a common property among those machines:

(TotalVirtualMemory < 1048576)

They only had one gigabyte of virtual memory! (Note: The units are KB.) Whenever a program would consume more than that, it was promptly killed by the operating system. This was simply a mistake made in configuration -- our admins fixed the setting, and the problem went away.

Here's another problem we found on the Wisconsin portion of the Open Science Grid. Processing the log data from 100,000 jobs submitted in 2007, we found that most failures were associated with this property:

(FilesystemDomain="hep.wisc.edu")

It turns out that a large number of users submitted jobs assuming that the filesystem they needed would be mounted on all nodes of the grid. Not so! Since this was an historical analysis, we could not repair the system, but it did give us a rapid understanding of an important usability aspect of the system.

If you want to try this out yourself, you can visit our web page for the software.

Wednesday, October 1, 2008

Clusters, Grids, and Clouds:
It's Turtles All the Way Down

In this blog, I'll discuss open problems and new developments in the field of distributed systems.

A distributed system is a set of independent computers that accomplish a meaningful task by working together. The field covers a huge range of interesting systems, from sensor networks to multi-tier web server farms. For the most part, I will be writing about the sort of large computing systems used to attack very large problems generated by big science and big business.

These systems go by many different names that mean very similar things:

A cluster is a distributed system that consists of a number of identical machines owned by a single entity, usually stacked up in a closet or a machine room. Clusters became the most common form of high performance computing in the 1990s and are the type of system now dominating the Top 500 List of supercomputers.
A grid is a distributed system that enables people to access computing resources from different institutions over the wide area. The term grid computing was coined by Ian Foster and Carl Kesselman in the late 1990s to describe easy access to large scale computational power. Examples of grids include TeraGrid and the Open Science Grid.
A cloud is a distributed system where the user doesn't care exactly what resources are used to carry out an operation; this is virtualization in the most abstract sense. There exist commercial clouds such as Amazon EC2, as well proprietary clouds found in nearly any industrial scale web site.

Although people love to argue about these terms, the differences aren't terribly important. In fact, the terms often represent different facets of the same systems. When you submit a job to a cluster, you are treating it like a cloud, because you don't care on exactly which CPU the job runs. A grid is usually built up from multiple clusters and connected together by wide area networks and software. A cloud is usually contained in a single data center. If you need to access it remotely, then you need a grid.

If you want to really want to mix it up, read about Ed Walker's MyCluster, which is a system that uses a grid to allocate a personal cluster, which you can then use as a cloud!

Regardless of what we call these systems, the challenges are the same. How do we design the interactions between pieces so that the system is robust to failures, has acceptable performance, and protects the interests of all of the parties involved? Future posts in this column will focus on these fundamental problems.

So that you know where I am coming from, I am a professor at the University of Notre Dame, where I teach classes in distributed systems, operating systems, and compilers. I direct the Cooperative Computing Lab where a great group of students does the hard work to design and test new distributed systems. We develop solutions to problems in physics, chemistry, biometrics, and other fields, and them evaluate them on our shared testbed of 700 CPUs. I'll be writing more about these ideas in this column.

Prof. Douglas Thain