New open source key-value store: sparkey
The project is called sparkey and it is a fairly simple key-value store, like many others out there. It’s inspired by CDB and accidentally similar to bitcask. Sparkey is implemented in C, but we also have python bindings for it as well as a port in Java. Currently only the C implementation is released but we hope to release the others as well in the near future.
Visit our github page for Sparkey at https://github.com/spotify/sparkey for more information.
Sparkey has been used internally for over a year on one of our core services – the content metadata lookup service – which gets heavy traffic both from our end users and from our other services and recently it’s also begun being used by some of our newer services.
First, a bit of history
For some of our services we’ve had use cases where we needed a storage system for mostly static data (periodically updated in bulk) with fast random access lookups. We relied a lot on CDB and Tokyo Cabinet depending on which fit best for the use case.
CDB is really nice for many reasons. It has very fast lookups, a small file size overhead and low creation time. The only reason we moved away from using it was the 4 GB file size limit which we grew out of for most of our use cases.
Tokyo Cabinet is also nice for many reasons. It has almost as fast lookups, supports both reads and writes, has no real file limit and supports compression of values. On the downside it only supports one reader instance per process, does a lot of locking and synchronization, has a fairly high overhead, and creation time is high once the file doesn’t fit in memory anymore.
As a fun experiment, we started hacking on something new that would be more suitable for our usecases. We liked the simplicity of CDB so that was our inspirational starting point. The first thing we needed beyond CDB was obviously a higher file size limit, or preferably no strong limit at all. We also wanted a small overhead, since we knew some our usecases had hundreds of millions of very small key-value pairs. Another problem we were facing was slow creation times for our Tokyo cabinet files, so we decided to make sure we could create these files quickly, by using a very simple append-only log format. Additionally it would be nice to support some kind of compression. Since our initial usecase had a lot of small objects, per-object compression would not be optimal so we decided on doing block-level compression, which means compressing multiple values together. The block size is tuneable, and lets us make the trade-off between compression efficiency and random access lookup speed. We also decided to support appending to an existing file unlike CDB, which in hindsight turned out to be something we used for a while but stopped using due to complexity. Immutable files are much easier to reason about and debug.
Shortly after having completed the basic implementation of these requirements, we started building a new fast metadata lookup service and decided to let it use sparkey.
How it works
A Sparkey key-value store consists of two files, a log file and an index file which is just a hash table on disk. You start by filling the log file with content, and since that is just file appends, it’s pretty fast even with a somewhat large data size. When the log file is done, you create the index file, which just means iterating over the log file again and populating the hash table in memory. When the hash table is complete it gets flushed to disk. This limits the hash table creation to your available RAM if you want it to be fast enough. This memory usage however unrelated to the size of the keys and values, only the number of objects matter. Expect up to 20.8 bytes per object in the hash table which means you can create about 400 million objects with 8 GB of RAM.
Once the hash table is completed, you can start doing random lookups. Each lookup on a cold page cache results in two disk seeks, once in the hash table and once in the log. Locking the hash table in memory reduces this to one disk seek, and depending on the distribution of lookups, you would likely keep the most important parts of the log in page cache which would help avoiding disk seeks for most lookups.
For more details on how it works, see https://github.com/spotify/sparkey#file-format-details
Performance is always a sensitive subject. There are usually issues with keeping the test unbiased, deciding what to measure and how to measure it, and what machine to test it. With that in mind, please take the values presented here with a grain of salt, or better yet, look at the benchmark code and run it yourself. Absolute values have very little meaning outside of a context where it can be compared with other data.
In uncompressed mode (which is the fastest mode) with 100 million entries we benchmarked it to having an insertion throughput of 1.2 million inserts per second and 500 thousand random lookups per second. The size of this data set was 4 GB. For more complete benchmark data, please visit https://github.com/spotify/sparkey#performance
- CDB: http://cr.yp.to/cdb.html
- Tokyo Cabinet: http://fallabs.com/tokyocabinet/
- Bitcask: https://github.com/basho/bitcask