linuxwebcluster.com

Amazon EC2 Cloud Render Farm Integration

On request from a prospective customer, I looked into the idea of integrating RenderFarmer with cloud computing technology. Of course we are already compatible with a PXE booted VMWare virtual machine image, but cloud computing is all about external resources that you can offload work to. Amazon EC2 is the leader in the "cloud for rent" category, so I did some homework to figure out how this might work. The idea is simple: If you want on demand capacity in your render farm, just boot up some cloud based "virtual render nodes" and they'll join your farm and go to work alongside whatever local nodes you might have. 

At first, the integration does not look straightforward. In RenderFarmer, everything gets tied in to a cluster shared root filesystem by PXE booting into an image that goes out over the local network and links to whatever it needs.  The nodes are diskless. But the latency between the cloud and us is far too great for that kind of live network interaction to work in that case. So no PXE. However, the whole point of PXE / shared root is that it scales fast. You can add nodes in a few minutes with no sysadmin skills or effort. Likewise, Amazon's EC2 scales fast. They have their own methods for doing this, so let's take advantage of them and not worry about PXE!

A "server" in EC2 is stored as an image, called an AMI (Amazon Machine Image). There are lots of basic images publicly available. You can choose one and then launch as many instances of it as you like. So our first challenge was to build our own AMI that had whatever cluster magic we would have gotten via PXE boot. After a while, a 32 bit Fedora Core 8 render node emerged that had all the required ingredients. 

Now the basic AMI instance was running but local rendering was incredibly slow. Turns out the caching does not work well enough; all the render engine files had to be local to the instance. I rebuilt the AMI to include the Maya render engine and tested again. OK, rendering is much better now!  

But we still need some level of tighter integration with the render farm here in the office. For instance,  the output directory for the images would ideally be the same for all nodes, whether local or remote. And there were some other remote directories that the DrQueue render queue manager needed in order for jobs to propagate information... There are several ways to do this. We could take the easy way and use NFS or even rsync at certain intervals, but this would take away a big advantage of RenderFarmer. By using the cluster filesystem, we retain the ability to add more servers later for read and write striping and fault tolerance.  This gives us the power to increase our i/o to almost any level that we need, and to do so transparently. By adding servers in the back end, "underneath" the clients, they all get the i/o benefits witout any configuration changes. Trying to add this on later to an NFS / rsync system would be painful or impossible. The cluster filesystem is more difficult to use, but it's worth it.

After many hours of script fu, we had a cloud based render node that processed jobs in real time. The question now is how fast are the different types of instances you can deploy from the cloud... There are three types of instances we are interested in. The first is a "small" that has one CPU. Amazon says it is equivalent to a 1 Ghz Xeon or Opteron and it costs 10 cents (.1 USD) per hour to run. The second is a "medium" instance that has dual Xeons and costs 20 cents per hour. 

The first test was to render 100 frames of GiantStormAnim.mb, a sample scene that comes with Maya. I clicked two or three buttons and deployed 10 "small" nodes, and started the job when they were done booting. 41 minutes and 20 seconds later, it was done! Next I deployed 5 dual cpu "medium" instances and ran it again. It was 30% faster, completing in just 28 minutes and 25 seconds. Good to know that the second setup has more CPU horsepower, despite having the same number of cores and the same $1 per hour cost (10 cents * 10 nodes or 20 cents * 5 nodes).

Next up is the Extra Large High CPU instance, which has 8 dual Xeons for a whopping 16 cores per instance. But I can't try it until I create a 64 bit OS image... That's ok, we came a long way so far and I'm very happy with the results. In effect, it is possible to deploy a temporary virtual render farm OF ANY SIZE that can be integrated 100% with RenderFarmer, just as if the nodes were local. The nodes can be all virtual, or the virtual nodes can just supplement any existing local nodes. From a user perspective (when launching or checking on jobs, say) one cannot tell that some nodes are local and some aren't. And when you aren't using your virtual nodes, you aren't paying for them! Also, just like with local PXE booting nodes, it only takes moments to add nodes on the fly and have them join into running jobs.

We are ready to help our customers get their own AMI images ready and working, and anyone interested in beta testing this exciting new technology should contact us as soon as possible.  Any customers with a support agreement will receive free assistance and integration! Of course, if you are not yet a customer but have questions about this technology, we are happy to assist you in determining if Amazon EC2 cloud integration will work for you. 

 

Add your comment

Your name:
Subject:
Comment:

Subscribe to the Linux Admins Blog and get new posts delivered by email!
Enter your email address:

Delivered by FeedBurner

Linux News


Tell the developers:

The type of clustering you are most likely to deploy is:
 
What Linux distro do you use for clusters?
 

Copyright 2010    RapidScale Clusters, LLC