LevelDB is a very popular data store that has been adopted in many domains far beyond its original use in Google Chrome. It's no surprise that many startups and large companies are using LevelDB: LevelDB internally stores data in contiguous chunks, leverages the performance of sequential disk I/O, and takes advantage of high performance buffer management in modern operating systems. This organization plays well with modern memory hierarchies and avoids conflicting with operating system decisions to yield high performance.
While stock LevelDB is an excellent foundation, our experience with HyperDex identified several opportunities for further performance improvements. This article describes the changes we've made to LevelDB meet HyperDex clients' demanding needs.
Specifically, these changes improve LevelDB in two ways:
Increased parallelism through careful lock management
Higher throughput through an improved compaction process
This article describes these changes in detail to give an intuition behind why the resulting fork, called HyperLevelDB, is so much faster.
Stock LevelDB internally protects all state with a single mutex, ensuring that multiple threads may safely write to the database simultaneously. Analogous to the global interpreter lock in Python, this mutex simplifies single-threaded execution at the expense of performance in a concurrent environment. On a modern processor with multiple cores, achieving high performance requires taking advantage of all cores. To this end, our first change was to crack this lock.
Of course, this has to be done carefully to avoid potential data races. Let's briefly examine the concurrency requirements of stock LevelDB before describing the new organization under HyperLevelDB.
Stock LevelDB provides internal synchronization to allow multiple threads to access the data store in parallel. Internally, writes proceed in FIFO order: the writer at the head of the FIFO is the only thread that may write into the database, and all others that are enqueued must wait to write. Every write inserts data into a disk-backed log and memory-backed "memtable". The log is an append-only file that provides persistence for crash recovery, while the memtable is a sorted skip list that enables reads to efficiently lookup data stored in the log. Writers must write data to both the log and the memtable before signalling the next writer in the FIFO.
To overcome the queuing effect, stock LevelDB batches writes that encompass a prefix of the FIFO, writing the data for many writers in one batch. This batching improves the performance, especially for synchronous writes, but does not fundamentally alter the sequential nature of writes. It also incurs a cost, as the thread doing the batching must necessarily read, and possibly copy, the data being written by the other threads. Only the head of the FIFO is doing any real work, and the other writers are blocked by the kernel waiting for a signal from the head writer.
HyperLevelDB improves parallelism between writers by allowing all writer threads to independently insert their own writes into the log and memtable, using synchronization solely for maintaining the order of writes. To understand how HyperLevelDB can do this, it's important to understand the way reads and writes interact within LevelDB.
Internally, every write within LevelDB is assigned a unique sequence number, generated by an ever-increasing counter. These sequence numbers enable readers to select data from a specific point in time, denoted by a sequence number for that point in time, and ignore any writes that happened logically after the sequence number. A LevelDB Snapshot captures the latest sequence number at the time the snapshot was created, enabling the snapshot to iterate or read the data without seeing any subsequent writes. Any read not performed on a snapshot will implicitly use the latest sequence number. These sequence numbers are the key to HyperLevelDB's parallel optimization, because they provide a way to control the data that is visible to readers.
HyperLevelDB maintains the invariant that the sequence number chosen for a read will never rely upon writes that have yet to complete. The writers update the log/memtable in any order, without regard for their respective sequence numbers. After writing, each writer uses synchronization to update the global sequence number such that the writes appear to have been performed in FIFO order, which maintains the required ordering invariant. Each writer thread is responsible for independently updating the log and memtable -- which constitutes the bulk of the work -- and only uses synchronization to increment the sequence number used for future reads.
To make the most of this new-found parallelism, HyperLevelDB modifies LevelDB's log and memtable implementation to safely permit concurrent insertions.
Thread-safe memtable: The memtable is implemented as a concurrent skip list. Google LevelDB's skip list allows lock-free reads, and uses external synchronization for writes. Reads and writes share the same first step: traverse the skiplist to find the desired key in the list. Our thread-safe implementation does this traversal without holding any locks, and then validates the traversal, possibly traversing further, after acquiring the lock. This moves the most time-consuming part of inserting into the skip list outside of the lock, which allows all threads to perform it in parallel without sacrificing correctness. In our measurement using four writer threads, our improved skip list has a 99th percentile insert latency of 1.7us, while the default skip list has a 99th percentile insert latency of 2.6us.
Thread-safe log: The append-only log is a file with a user-space buffer. Google's implementation always appends to the end of the file, maintaining a sliding window with the user-space buffer, using mmap and munmap on the file as necessary. HyperLevelDB's implementation also maintains the user-space buffer with mmap and munmap, but allows concurrent threads to atomically append to the file. The log synchronizes access to the underlying buffers, and assigns non-overlapping regions to different writers which allows the writers to copy their data in parallel.
LevelDB takes its name from the tiered storage structure it employs. Data is stored within levels, where each level contains more data than those below it. Within each level, data is sorted and stored in files called a sorted-string tables, or SSTs, where every SST is sorted, and SSTs are non-overlapping, so each key will reside in at most one SST. Writes are merged into the bottom of the tree, and a process called compaction moves data upwards through the levels, maintaining the sorting invariants within each level.
The incremental compaction used within LevelDB represents a trade-off between two extremes. Without any compaction, data written to the system remains unsorted, requiring linear access to all data to perform a key lookup. Alternatively, compaction could be taken to the extreme wherein it sorts everything, generating ever-larger SSTs, and rewriting the entire data on disk with every compaction. Compaction in LevelDB ensures that data remains sorted, increasing read efficiency, while performing incremental maintenance to limit the amount of data that a compaction will write.
The downside to any form of compaction is that it inherently suffers from a form of write amplification where the amount of disk I/O is a multiple of the amount of data written to LevelDB because data will be rewritten at each level of the tree. The exponential difference in size between levels all but ensures that an SST at Level-X will overlap with multiple SSTs at Level-X+1. In our experience, the default LevelDB implementation will rewrite three bytes from Level-X+1 for every byte written from Level-X. Only one quarter of the I/O generated by LevelDB is doing productive work; the rest is wasted I/O and a target for optimization.
The many approaches one could take to reducing this wasted I/O include:
Tuning the compaction-related constants using empirical analysis to reduce write amplification.
Leave data unsorted in lower levels to avoid linear merge costs at the expense of read latency.
Pick compaction targets (SSTs) that explicitly minimize write amplification.
Of these three approaches, only one directly tackles the problem of write amplification. The first two techniques require manual, hands-on tuning to find a performance sweet spot for a given workload. We found such tuning to be fragile and unsatisfying because it was sensitive to the workload. HyperLevelDB follows the third approach and provides a principled redesign of the compaction process.
HyperLevelDB's altered compaction algorithm explicitly chooses a set of SSTs which result in minimal write amplification between two levels. A background compaction thread monitors the size of each level, and compacts using the optimal set of SSTs for compaction. A second background thread, which triggers when the first thread becomes overwhelmed, performs global optimization to select compaction targets with low write amplification, without regard for the size of the individual levels. Typically, this second thread will move files from the lower levels without any write amplification. For the upper levels, the second thread will preemptively compact data often enough to keep them under the size-limit heuristics used by the first thread, ensuring that the first background thread will not be tied up in compaction at the upper levels.
HyperLevelDB improves LevelDB's performance through two significant redesigns of its architecture. The resulting fork is a factor of four faster than stock LevelDB.
If you're using LevelDB in production, feel free to check out HyperLevelDB (Get the code!). It's licensed under the same terms as LevelDB, maintains API compatibility, and offers improved performance.
If you've rolled a custom distributed system on top of LevelDB, you' should check out HyperDex, which provides the benefits of LevelDB in a consistent, fault-tolerant distributed data store.
Log Structured Merge Trees: LevelDB is a form of a Log Structured Merge Tree.
HyperDex: Our work on HyperLevelDB was motivated by performance requirements passed to us by several customers and users of HyperDex.