Rsync in a Week

This past semester I had the luxury of taking a course in Advanced Data Structures part time as I slowly work on my master’s. It was probably one of the most difficult classes I’ve taken so far. Given my background focusing on the lower half of the technical stack and having never taken a course in algorithms, it was a rough start. We just covered so much material and I spent a lot of time just trying to reabsorb it all. For those in the know, we went through CLRS cover to cover in a single semester. Most of all though, it was a test of my discipline as it was the first course I’ve taken outside of undergrad while working full time. I learned just as much about time management as I did about data structures. Overall, it was a rewarding experience and I feel that I learned many invaluable skills from this course and I’m happy to have taken it.

My favorite part of the course was the final project. We were tasked with implementing what’s known as a File Reconciler which takes a file on one computer and recreates its changes onto a similar file stored elsewhere. The idea is that this should take a minimal amount of bandwidth compared to simply sending over the file in its entirety by only sending differences. At first it was a daunting challenge, but with a firm understanding of running times and data structures the problem suddenly became very tractable. I took on the project lead by delegating work, architecting the software, and most importantly implementing the core algorithm behind our program. The latter being self-imposed after I discovered a fundamental flaw in my peers’ approach during a code review.

I modeled our program against Rsync which was a utility written in the dial-up days to handle synchronization of large code bases over high latency low bandwidth networks. It basically works by splitting a reference file into blocks that are then matched on the file we are trying to reconcile. Then we come up with a list of steps that either provide literal data or a reference to a matched block. The brilliant part of the algorithm is how it matches blocks. For each block, a pair of independent hashes(MD5 and Adler-32) are generated that uniquely describe it. The probability of two random blocks having matching signatures is on the order of 10^(-100). MD5 was chosen because it’s a strong hash that generates a unique value that isn’t likely to collide in the same file. More importantly, Adler-32 was chosen for its rolling checksum variation. It has the wonderful property of constant time computation on single byte offsets and incorporates byte order into its entropy. This allows us to slide through a file and compute each and every possible block and search for a match. On a match of the weak, we can then compute the strong and verify. With the two hashes, the algorithm is both computationally cheap while being robust against hash collisions.

By far, the most interesting exercise of the project for me was the chance to take a concept from an academic paper and turn it into code that actually works. Further, due to time constraints, I had to do this in about a week using only my few hours after work. Usually I read papers for the high level theoretical understanding, but rarely is it ever the even deeper understanding of translating academic jargon into a usable object. After understanding how it all worked, I then had to begin the greater engineering challenge of deciding what parts of the algorithm to implement. This was a question of what is an essential component, what gave us the most performance for the effort, and what I could conceivably implement given my intellectual resources. Most notably, the thesis provided many suggestions for using parallelism to improve performance, I felt that it would add too many risk factors and development effort that I could not spare. It’s also important to realize that while performance is great, the project requirements (50MB of bandwidth) were trivially met by even the most naive implementation of Rsync.

To present my work I prepared a few slides to illustrate what Rsync does visually, I am by no means a graphic artist but it does the job. My goal was to highlight the data structures which data structures were used, how they were used, and why.  These were included in a video created by my group for the class’s critique. I apologize in advance for the awkward choppy audio. This is due to me having to heavily edit my fellow group members’ audio tracks to fit within the 10 minute presentation window.

Github Project Link

This slideshow requires JavaScript.


Leave a Reply

Your email address will not be published.