High Performance Bitcoin Mining with CUDA

As I wrote about previously, I created a GPU Bitcoin miner for my high performance computing(HPC) project. Having now completed it, I can safely say that I couldn’t have picked a better project for this class. It’s definitely been one of my favorites because of how “hot” Bitcoin is in the development community. With so much interesting content being generated around Bitcoin and after working with it on this level I can appreciate how revolutionary it is to the field of computer science. To me, the revolution is not in “digital currency” since Visa and Mastercard pretty much figured that out a long time ago. The revolutionary idea in Bitcoin is a decentralized protocol for consensus. Few things on the internet really run in a decentralized way like Bitcoin does and does well. I don’t think it will last in its current form, but rather its ideas will live on and be reborn in some novel application.

Coming into HPC, I thought it was about doing things in parallel on GPUs. In reality, HPC is really the study of minimizing those constant factors that magically disappear in the asymptotic notation of algorithms. It’s studying the features of the hardware and optimizing the code to maximize utilization of every bit of hardware available to you. The amount of engineering trade offs involved with writing high performance code is an interesting topic by itself. It’s an iterative process of optimizing and seeing what does and does not work. There’s no magic bullet that can generate code optimized for this level of performance, to get it requires thoughtful coding with respect to the underlying hardware.

Back to Bitcoin though. On a high level, Bitcoin mining is simply brute forcing the SHA-256d hash function of the block header such that the resulting hash is less than a certain number known as the difficulty. As a miner, you change a 4 byte nonce value as well as other fields to generate work for brute forcing. What’s interesting to note is that this problem is trivially parallelizable in that each unique work item (A SHA-256d hash computation of a given nonce) is independent of all other work items. The memory storage and bandwidth requirements are also negligible in that the block header is a paltry 80 bytes and the output is simply a 4 byte nonce value. This puts the problem squarely in the realm of computationally bound problems. The computation itself is also just a series of bit manipulations and integer arithmetic. This is also why I hate how the media refers to mining as “Solving a difficult math problem” because it really isn’t. The difficult part is that there are so many other people on the network with much better resources competing for a new block every 10 minutes.

To map this problem to a GPU then, I created a new thread for the 2^32 nonce values and have each one compute the SHA-256d hash. To make a long story short I started by writing a sequential miner for a CPU with a hash rate of 418Kh/s to a GPU miner that mined at 46.9Mh/s. That is an over 100x improvement just from being able to use the parallel hardware of a GPU. The details on this can be found in my Github repository which includes a detailed write up of the various optimizations performed.