Adding Xen VTPM Support to OpenStack

Previously, I wrote about how to survive OpenStack Development so now I’ll write about how I applied that knowledge. The goal of this project was to build infrastructure that would allow an untrusting cloud tenant to verify for themselves the integrity of their VM and the hypervisor it runs on. If you’ve done any work on cyber security you will understand that trust is very valuable these days, so this is a big step in allowing security sensitive entities the ability to make use of cloud computing infrastructure.

Building on that motivation, there’s this idea of Trusted Computing which is a set of technologies developed to make sure that a computer “behaves in an expected way”. For example, if there were a rootkit in the BIOS or kernel we don’t necessarily want to patch it up and carry on; we instead want to recognize that something unexpected happened and remove the system from the ring of trust before it can do any harm. I.e. we want to prevent this machine from accessing any higher level security services.

In order to do this we make use of a piece of hardware known as a TPM which has a set of registers known as PCRs and a hardware bound endorsement key (EK) for signing the PCRs. This configuration allows us to trust that the PCRs are coming from a particular TPM and that they haven’t been tampered with. To make this useful we measure (hash) any critical bit of software before running it and “extend” it into a PCR creating a hash chain. What you end up with is a set of registers that represent all the software that got you into user space. If you have a “white list” of PCR values, you can then immediately detect any anomalies. The caveat here is that this is only as good as your measurement and attestation infrastructure.

What we did extend the measurement infrastructure beyond the physical hardware and into VMs. What’s tricky is that you can no longer rely on a single physical TPM when you have multiple operating systems trying to measure things concurrently. To get around this hypervisors have implemented Virtual TPMs (VTPMs) which both provides a unique TPM to each VM and exposes the the physical TPM’s PCR values. That way, a tenant running in the VM’s user space can first attest the integrity of their hypervisor with the physical TPM and then extend that root of trust up to their OS with the VTPM. Currently, this is a fairly cumbersome process so we seek to automate all of this by allowing OpenStack to provision VTPM resources and integrating it with attestation infrastructure (Keylime) developed by this project’s mentor.

The diagram above illustrates our stack which consists of Xen, OpenStack, and Keylime. In Xen, you have domain0 which you can think of as the “root” user for the hypervisor. It exposes hypervisor management through a native library known as LibXL. In order to support multiple virtualization layers, OpenStack uses LibVirt as a common abstraction layer. What’s missing is support for VTPMs in everything above the LibXL layer.

Starting with the LibVirt layer, we needed a way to define our intent to spawn a VTPM so I wrote a specification for a new device in the domain configuration file. This was a matter of parsing parameters out of the XML file and into internal data structures. On the other end I then translated those internal configuration structures into the native Xen structures to spawn the VM. Overall, it was a very straightforward patch to carry out and my hope is to get it pushed upstream at some point.

The tricky bit was dealing with the OpenStack layer and its immensity. On top of spawning the VM it also had to provision the VTPM resources beforehand. As far as Xen was concerned, you needed a UUID from the VTPMMGR and a small backing image to create a VTPM. To get the UUID, we needed to expose a rest API on the Keylime VM which acted as a proxy to the VTPMMGR. This was necessary because Domain0’s kernel yields the physical TPM to the hypervisor by removing support for TPMs entirely. For the image, we simply had Nova create a file out of /dev/zero. Putting this all together, Nova provisions these resources and generates an XML file for the VTPM which goes into the patched LibVirt. After that, we generate the last XML file for the VM that connects it to the VTPM.

I learned a lot from this project and in some ways it may have changed my career trajectory. For one, I realize that I still love building backendy infrastructure type things. Unexpectedly, I also developed more of an appreciation for FOSS and its community. I’m now much more comfortable diving into these projects, reaching out for help, and contributing patches then I ever was before. Bugs that I would quietly complain to myself about before are now submitted as bug reports and I’d make an honest effort to patch myself. Most importantly though, this project and the class that came with it opened my eyes to the exciting work going on in cloud computing. While I missed being on the ground floor of this work, I believe were on the cusp of a Cambrian explosion of sorts in this field.

README and Nova Patch

LibVirt Patch

Setting up a Cloud on the Cloud with SLURM

Awhile back I vowed to dip my feet into the world of cloud computing without really understanding what exactly the cloud was. In the simplest terms I can put it, the cloud provides virtual building blocks to build a data center on the cheap. The key word which makes this all possible is “virtual”. Real servers running all sorts of virtualization hardware/software significantly increases the number of services a single server can offer which drives down the cost. Since these services are virtual, software can then offer a level of speed/flexibility that’s simply not possible when dealing with real hardware. This is the primordial soup that enables the internet we know today. It’s the reason I can host my blog to the world without worrying about the cost is because I share this IP address with at least 210 other domains. It’s also how Netflix can adapt the number of servers in their cluster throughout the day to match the diurnal patterns of its human end users. Cloud computing is what’s driving the current revolution on the internet and I’m glad I was able to take on this project to understand how it works in the backend.

In this project my colleague and I presented an experimental method for analyzing scheduling policies for batch computing clusters. Basically this meant implementing benchmarks that stress test a scheduler and produce a metric for comparison. These benchmarks amounted to running a bunch of jobs that reserves a number of nodes for a certain amount of time. The paper we pulled the paremeters from called it the “Effective System Performance” (ESP) benchmark. My implementation of this was a Python script that submitted sleep jobs to the cluster with different reservation parameters. It also scaled the benchmark to the size of the cluster being tested so its performance was agnostic of the underlying hardware. We also implemented a few other benchmarks, but they did not end up producing any interesting data.

The fun part was setting up a cloud to test our methods. Through the university we had access to the Massachusetts Open Cloud (MOC) which was running Open Stack to provide a public cloud. On this cloud we spun up 8 servers to create a small cluster. We then setup SLURM on the cluster which is a batch job scheduler. In other words with SLURM you can submit work to a control daemon and it finds a suitable set of computer slaves to do that work. People that have worked in HPC or any kind of scientific computing should have experience in a system like this. Overall, it wasn’t too difficult to set up but it did test my competency in navigating and administrating Linux servers. By the end we had the cluster under control with near complete startup scripts that can immediately bring new nodes online. We also had systems in place to roll out changes to the entire cluster which was essential for test and development. My only wish would have been to have more time to play with it and make it closer to production ready.

I’ve included the full writeup and source code in the bottom of this post. It provides all the information you need to set up your own SLURM cluster. The paper provides a more in-depth view of the technologies involved, what we looked into, our methods, and our results. This was the first time I’ve used the IEEE LaTeX template for a paper and the resulting output was just beautiful.

presentation slides

slurm writeup


Buffet Tracker and Winning My First Hackathon

A few weeks ago I was skimming one of the bulletin boards at the university for any interesting lectures and a poster for “Hacking Eating Tracking” caught my attention. The idea of attending a hackathon has always intrigued me, but I was always too intimidated to attend. As an introvert it was the anxieties of forming a good team, not being able do anything of value, and not having the requisite skills that built up the barrier to entry. Still, after reading through the website and mulling it over a few days I decided to apply. My motivation for doing so came from my recent attentiveness to healthy living, the emphasis on hardware skills for this hackathon, and some wish to combine the two. Having now made it through the hackathon I can say that my earlier anxieties just didn’t make sense. In a way my experience has been about making those kinds of anxieties easy to conquer. Think of a hackathon as a place to meet interesting people, build things that aren’t perfect and an opportunity to learn new skills. Just don’t be afraid to go to a hackathon for the wrong reasons.

Prior to the hackathon, there was Slack setup for people to meet and discuss projects and logistics. I commented on a few ideas and proposed my own for tracking grocery ingress/egress by scanning bar codes but nothing really came of it. It wasn’t until I read Anandh and Arjun’s proposal to continuously track the weight of serving vessels in a buffet style restaurant that I saw an interesting project to work on. It was possible to carry out, the solution was non-obtrusive to the participants being tracked, and the data had a lot potential. After making contact with them, all that was left was to wait it out until September 18th rolled by.

The event started with a kickoff talk where I first met all the other hackers. What surprised me was the breadth of different backgrounds everyone came from. In terms of age I found underclassmen in undergrad up to working professionals, though admittedly the skew is towards university students. More unexpected was everyone’s point of origin. There was an entire bus of hackers that came from McMaster in Canada, a handful of people were flying in from western Europe and people came from all around the northeastern states. Technical backgrounds included health, computer hardware/software, and data science but the overwhelming majority were computer science students. I honestly felt like the minority being the computer generalist working in Boston.

Team forming took place the following morning where I had my first glimpse at all the projects being attempted. We weren’t able to permanently recruit anyone to join our team which was a bit disappointing at first, but it worked in our favor to have much lower synchronization costs. After the teams coalesced we started to get some work down and fully develop our idea for “Buffet Tracker”. The premise was to track eating behavior in a buffet style restaurant because it’s case where people are financially incentivized to over consume. To do this we wanted to link a change in weight of a serving vessel and link it to a specific plate when a person goes to take food. The result would be a dataset that has all the food procured from the restaurant’s serving area and proportion of food on plates throughout the day. With that kind of data you are able to start looking into research questions such as “What foods occur together with the highest probability?”, “What restaurant layouts result in the least(or most) consumption?”, “What is the nutritional content of the average plate?”, or “What time of day are people most likely to eat a certain type of food?”. Too many of the other projects focused on tracking at the individual level which tends to need some voluntary action or increased overhead/complexity. Our project’s strength was to avoid that completely by trading off granularity for a completely invisible solution to the problem.

The Buffet Tracker Proof of Concept

In order to get this all to actually work we settled on tracking plates via RFID tags hotglued. At each serving station there will be a microcontroller responsible for measuring a change in weight and an RFID reader for detecting the plate. When it detects a procurement event it compiles the plate ID and the change of weight and reports it to a local computer which parses the data and adds it to the database. A web server then presents a front-end for querying the database and presenting graphs or tables. My part of this project was to design the hardware to do all this. All we got from the hackathon was an Arduino, a force sensitive resistor, and a USB cable so I had to rely heavily on my stash of parts and tools which wasn’t too bad. We would’ve been doomed though had I not had a bunch of RFID tags and a reader. Since we only had a day to do this I settled on using the Arduino and the force sensitive resistor to measure the weight. What I didn’t know was how bad it was at doing that, the granularity was basically enough to tell if there was an object on top of it or not. It also required that force be applied evenly over the surface area of a small disk. To solve this we wrapped it in a rubber band and hot glued a plate over it to serve as a demo serving vessel. The food also needed to maintain a centroid about the middle of the plate. Since the sensor was a passive resistive type, I created a voltage divider to sample the weight and calibrated it using some melon and a food scale I had lying around. Unfortunately I found that the voltage/weight relationship was also nonlinear and very imprecise, but I went ahead and made a linear approximation of it anyway just to have something to present. To clean up the noise from the data I took the mean of eight samples (Power of 2 for quick division hehe) and reported that. For communications I used the onboard USB UART chip to communicate with the host computer which is not the most elegant or scalable solution, but it’s what I had available to me. This created a problem for the ID-12 RFID reader I had which depended on UART to report the ID of the tag, but I found a nice Arduino library that bit banged a UART at the correct baud rate. For simplicity I used the RFID tag to frame the sample event so I sample the weight when the plate is first detected and again when the plate is removed. As a hack to figure out when the customer takes away their plate , I had to periodically toggle the reset pin on the ID-12 reader to trigger a new read. The moment it failed to read the plate has been removed. With that, I had all the necessary pieces to link a plate to a change in weight. All that was left was to craft a packet and dump it over the bus where a python script picks it up and pushes it up to the database on the cloud.

I managed to finish up the hardware sometime around 2AM and decided to go home for some sleep. The next morning we did integration and documentation which was admittedly very rushed, but I guess that’s the spirit of the hackathon. To my surprise, everything just worked after integration thanks to our well-defined hardware/software interface. We just replaced their test unit with the unit I completed last night and the entire stack worked perfectly. It was just incredibly satisfying to see the tables and graphs update immediately after we transfer food from the serving vessel to our demo plate. After that, we developed our presentation materials and pitch. Of course when it came to demo the product we had technical issues moving from the development environment to production, something about the wi-fi in the room failing and we couldn’t connect to the database on the cloud. In hindsight we probably should have just ran this all locally, but oh well the judges ended up loving the idea. It just met the requirements of the hackathon really well in that our solution provided clean quantitative data, representative of a population, and completely non-obtrusive.

By the time we’d presented to the audience and it was time to announce the winners we were teeming with anticipation as they kept announcing teams that weren’t ours. I really didn’t expect us take first place at all given that I really had no idea what I was doing at a hackathon, but I was obviously thrilled about the outcome. We ended up with $1500 split three ways, which was a nice little bonus. Would I do it again? Probably not. It’s pretty exhausting and stressful while the per-hour value for the prize money was pretty low. I’m just glad I can finally cross this off the bucket list and ended up with an overall positive experience. I met some great people (Big thanks to Anandh and Arjun!), developed a novel idea, won some cash, and got a neat story out of it. Hackathons are pretty cool, but not for me at this point in my life.