Awhile back I vowed to dip my feet into the world of cloud computing without really understanding what exactly the cloud was. In the simplest terms I can put it, the cloud provides virtual building blocks to build a data center on the cheap. The key word which makes this all possible is “virtual”. Real servers running all sorts of virtualization hardware/software significantly increases the number of services a single server can offer which drives down the cost. Since these services are virtual, software can then offer a level of speed/flexibility that’s simply not possible when dealing with real hardware. This is the primordial soup that enables the internet we know today. It’s the reason I can host my blog to the world without worrying about the cost is because I share this IP address with at least 210 other domains. It’s also how Netflix can adapt the number of servers in their cluster throughout the day to match the diurnal patterns of its human end users. Cloud computing is what’s driving the current revolution on the internet and I’m glad I was able to take on this project to understand how it works in the backend.
In this project my colleague and I presented an experimental method for analyzing scheduling policies for batch computing clusters. Basically this meant implementing benchmarks that stress test a scheduler and produce a metric for comparison. These benchmarks amounted to running a bunch of jobs that reserves a number of nodes for a certain amount of time. The paper we pulled the paremeters from called it the “Effective System Performance” (ESP) benchmark. My implementation of this was a Python script that submitted sleep jobs to the cluster with different reservation parameters. It also scaled the benchmark to the size of the cluster being tested so its performance was agnostic of the underlying hardware. We also implemented a few other benchmarks, but they did not end up producing any interesting data.
The fun part was setting up a cloud to test our methods. Through the university we had access to the Massachusetts Open Cloud (MOC) which was running Open Stack to provide a public cloud. On this cloud we spun up 8 servers to create a small cluster. We then setup SLURM on the cluster which is a batch job scheduler. In other words with SLURM you can submit work to a control daemon and it finds a suitable set of computer slaves to do that work. People that have worked in HPC or any kind of scientific computing should have experience in a system like this. Overall, it wasn’t too difficult to set up but it did test my competency in navigating and administrating Linux servers. By the end we had the cluster under control with near complete startup scripts that can immediately bring new nodes online. We also had systems in place to roll out changes to the entire cluster which was essential for test and development. My only wish would have been to have more time to play with it and make it closer to production ready.
I’ve included the full writeup and source code in the bottom of this post. It provides all the information you need to set up your own SLURM cluster. The paper provides a more in-depth view of the technologies involved, what we looked into, our methods, and our results. This was the first time I’ve used the IEEE LaTeX template for a paper and the resulting output was just beautiful.