Prof. Douglas Thain: cloud computing

Showing posts with label cloud computing. Show all posts

Thursday, October 1, 2009

Partly Cloudy with a Chance of Condor

We have been thinking about cloud computing quite a bit over the last month. As I noted earlier, cloud computing is hardly a new idea, but it does add a few new twists on some old concepts in distributed systems. So, we are spending some time to understand how we can take our existing big applications and make them work with cloud systems and software. It should come as no surprise that there are a number of ways to use Condor to harness clouds for big applications.

Two weeks ago, I gave a talk titled Science in the Clouds at an NSF workshop on Cloud Computing and the Geosciences. One of the points that I made was that although clouds make it easy to allocate new machines that have exactly the environment you want, they don't solve the problem of work management. That is, if you have one million tasks to do, how do you reliably distribute them between your workstation, your campus computer center, and your cloud workforce? For this, you need some kind of job execution system, which is largely what grid computing has focused on:

As it stands, Condor is pretty good at managing work across multiple different kinds of systems. In fact, today you can go to a commercial service like Cycle Computing, who can build an on-demand Condor pool by allocating machines from Amazon:

Just today, we hosted Dhruba Borthakur at Notre Dame. Dhruba is the project lead for the open source Apache Hadoop system. We are cooking up some neat ways for Condor and Hadoop to play together. As a first step, one of my students Peter Bui has cooked up a module for Parrot that talks to HDFS, the Hadoop file system. This allows any Unix program -- not just Java -- talk to HDFS, without requiring the kernel configuration and other headaches of using FUSE. Then, you can submit your jobs into a Condor pool and allow them to access data in HDFS as if it were a local file system. The next step is to co-locate the Condor jobs with the Hadoop data that they want to access.

Finally, if you are interested in cloud computing, you should attend CCA09 - Cloud Computing and Applications - to be held in Chicago on October 20th. This will be a focused, one day meeting with speakers from industry, academia who are both building and using cloud computers.

Thursday, December 11, 2008

Abstractions, Grids, and Clouds at IEEE e-Science 2008

I just attended the IEEE conference on e-Science in Indianapolis, and gave this talk on harnessing distributed systems with high level abstractions.

Another highlight of the conference was Rich Wolski's talk on Eucalyptus, an open source toolkit for cloud computing. Like Nimbus, it is API compatible with Amazon's EC2. That is, if your code works with Amazon, you can install Nimbus or Eucalyptus on your on cluster and run your own cloud.

Rich also spoke on the distinction between clouds and grids, which to many are the same idea. However, he pointed out a very important distinction:

- Clouds provide a resource allocation service. You ask for a certain number of machines, you get them for a certain amount of time, and you can choose to use them however you like. This gives you guaranteed service, which is great if you are running a web server, but can lead to underutilization if your goal is to run a large number of simulations.

- Grids provide a task execution service. You submit work to be executed, and the system decides where and when to execute the tasks, perhaps preempting or migrating them along the way. You have no guarantees of service, but the system will get very high utilization, making it good for high throughput computing.

As such, you can combine the two ideas together. For example, Cycle Computing provides a value added web service that employs both Amazon and Condor. You request a certain number of CPUs to execute a certain number of tasks. Cycle allocates the CPUs using Amazon, installs Condor on the nodes, and then runs the jobs. The result is a grid running on a cloud.

Wednesday, October 1, 2008

Clusters, Grids, and Clouds:
It's Turtles All the Way Down

In this blog, I'll discuss open problems and new developments in the field of distributed systems.

A distributed system is a set of independent computers that accomplish a meaningful task by working together. The field covers a huge range of interesting systems, from sensor networks to multi-tier web server farms. For the most part, I will be writing about the sort of large computing systems used to attack very large problems generated by big science and big business.

These systems go by many different names that mean very similar things:

A cluster is a distributed system that consists of a number of identical machines owned by a single entity, usually stacked up in a closet or a machine room. Clusters became the most common form of high performance computing in the 1990s and are the type of system now dominating the Top 500 List of supercomputers.
A grid is a distributed system that enables people to access computing resources from different institutions over the wide area. The term grid computing was coined by Ian Foster and Carl Kesselman in the late 1990s to describe easy access to large scale computational power. Examples of grids include TeraGrid and the Open Science Grid.
A cloud is a distributed system where the user doesn't care exactly what resources are used to carry out an operation; this is virtualization in the most abstract sense. There exist commercial clouds such as Amazon EC2, as well proprietary clouds found in nearly any industrial scale web site.

Although people love to argue about these terms, the differences aren't terribly important. In fact, the terms often represent different facets of the same systems. When you submit a job to a cluster, you are treating it like a cloud, because you don't care on exactly which CPU the job runs. A grid is usually built up from multiple clusters and connected together by wide area networks and software. A cloud is usually contained in a single data center. If you need to access it remotely, then you need a grid.

If you want to really want to mix it up, read about Ed Walker's MyCluster, which is a system that uses a grid to allocate a personal cluster, which you can then use as a cloud!

Regardless of what we call these systems, the challenges are the same. How do we design the interactions between pieces so that the system is robust to failures, has acceptable performance, and protects the interests of all of the parties involved? Future posts in this column will focus on these fundamental problems.

So that you know where I am coming from, I am a professor at the University of Notre Dame, where I teach classes in distributed systems, operating systems, and compilers. I direct the Cooperative Computing Lab where a great group of students does the hard work to design and test new distributed systems. We develop solutions to problems in physics, chemistry, biometrics, and other fields, and them evaluate them on our shared testbed of 700 CPUs. I'll be writing more about these ideas in this column.