Bitcoin had a problem. A problem so big that the community came together to issue a bounty of 10BTC and 200.2LTC, or an excess of $10,000 at today's exchange rate. In this article, we'll look at the problem they encountered and how we fixed it.
On Mac OS X, LevelDB can become corrupt and fail to re-open the database. This issue can prevent users from storing the block chain and participating in the Bitcoin network. Users affected by the bug will typically see an error to the effect of Corruption: block checksum mismatch or missing start of fragmented record(2). These log messages both stem from the same underlying problem within LevelDB.
For efficiency, LevelDB always performs I/O in a sequential fashion: every write generated by LevelDB appends data to a disk. Consequently, data is never overwritten, only recreated in another file. In a straightforward implementation, this mechanism could be accomplished using the write system call sequentially.
But the write call can be costly when under contention by multiple threads, so the LevelDB developers implemented a user-space buffering mechanism using memory-mapped I/O. This buffering mechanism maps regions of fixed size such that the number of pages mapped in any one file will never exceed a particular constant. At any given time, the writing process is filling whichever region is mapped. It then unmaps the region before moving onto the next.
Of course, it's possible for a user's write to straddle the boundary between the current region and the next region to be mapped. When such an overlap occurs, the writer places whatever it can from the user's write into the current region, unmaps the current region, maps the next region and finishes placing the data.
On Linux, this sequence of operations is safe because the virtual memory subsystem will keep these pages around until they can be written to disk. Programmers refer to these pages with unwritten data as being "dirty". As far as we are aware, POSIX imposes no requirements on the handling of dirty pages after an unmap. It's entirely possible for a POSIX-conformant system to simply discard any dirty pages that the user unmapped.
And therein lies the problem. On Mac OS X, the dirty pages seem to be discarded without being flushed to disk.
If the munmap call discards the data at the tail of a memory-mapped region, there will be zero-filled holes in the output file. For SST files, these zero-filled holes will (likely) cause block checksums to not match, triggering the first error message.
Manifest and log files, to which the second error above refers, use a slightly different on-disk format to aid in error recovery. Each write is represented by one or more "fragments" such that no fragment extends across a 32KB boundary. Whenever a write would extend across a 32KB boundary, it is broken up into at least two consecutive fragments, one on each side of the boundary. Any data discarded is most likely to be immediately prior to the 32KB boundary. When reading the log or manifest, the reader will miss the first fragment (it's all zeroes), but will see the second fragment, and complain that the first fragment is missing.
The short term fix is to always sync data to disk before calling munmap. Our patch does this for OS X, but the fix could easily be applied on other systems, too. We're already in talks with the upstream LevelDB authors for a more portable solution to avoid further errors of this kind.
Since the bounty was posted, people have come out of the woodwork to propose various non-fixes to this problem. In this section, we explain why these proposals do not explain the behavior and why they do not fix it.
That may be, but it doesn't explain the problematic behavior. Even an fsync call that truly does write outstanding data on a file descriptor to disk cannot fix this problem as the interaction between mmap and fsync is not well defined. Using an ioctl will not guarantee that dirty, unmapped pages will be written to disk.
Some people conjecture that the order in which memory barriers are defined can impact the bug. These memory barriers have absolutely no bearing on how sequential files are written because LevelDB uses the platform threading primitives for synchronization. Changing the definition order of these memory barriers will not affect the platform's primitives.
Some people conjecture that Time Machine is responsible for these hundreds of NULL bytes at the end of page-aligned blocks. This explanation is too outlandish to consider -- this kind of an error with the Time Machine would have manifested itself in other programs, not just with LevelDB.
Some people suggest using the sync=true option for every write. Because fsync is called after the unmap, this cannot fix the issue.
I first learned of this bug many months back in a manner unrelated to Bitcoin. Within HyperDex, we use LevelDB as a storage backend, and a few users have reported to us that they were experiencing corruption within LevelDB on OS X machines. I've been trying to find a live sample of this corruption for a few months. Because OS X is a development, but not production, platform for our users, it was hard to isolate a live test case.
When the Bitcoin bounty turned up, I moved into bounty-hunter mode, partially because of the bounty, but mostly because I was hoping we could fix the issue for HyperDex users as well.
Early this morning, I contacted Gavin Andresen of the Bitcoin project to ask for a live sample and went to bed. When I woke, he had provided a sample and within three hours I sent the fix off to the LevelDB mailing list.
Our motivation for finding this bug stemmed primarily from our desire to fix LevelDB for HyperDex. Fixing problems affecting the Bitcoin and Node.js communities is just a bonus.
We've identified the problem, and have proposed a fix that will immediately address the needs of the HyperDex and Bitcoin communities. So where are we going from here?
We've got a fix for Google's LevelDB, which the Bitcoin community can immediately merge into their LevelDB code.
We'll be publishing an updated HyperLevelDB containing a modification of our proposed fix.
We look forward to working with the Bitcoin and LevelDB communities to integrate our proposed patch. In the future, we're available to address future issues reported within LevelDB.
There are Bitcoin and Litecoin test builds available. We're looking for people who were affected by the bug to confirm that our applied patch does indeed fix the corruption problems. Please follow the instructions in that thread to submit feedback.