I keep a little black book with ideas for businesses or projects and sometimes also technology I want to learn more about. One of those things I wanted to learn more about was the inner workings of a key-value store. I wanted to know how to allow for virtually infinite growth of such a store without sacrificing read or write speeds or how to best organize the data on a disk.

I’ve read up on this and used some useful things such as SSTables in some solutions (at work) before, but I thought I still wanted to take a more generic approach to get a better picture of other aspects too, and so about two years ago I started working on an embeeded key-value store for .Net, labelling it TeaKV. Fun fact: I started calling this just TKV (the key-value store) initially, and only eventually expanded the T to Tea - not that it matters though.

On a high level, my key-value store is embedded, i.e. hosted entirely inside the process that uses it, but of course that process could also be part of a cluster of peer processes using consensus to offer a highly available and redundant key-value store service. Sure, such services already exist out there, but the same applies to embedded key-value stores.

TeaKV keeps an in-memory key-value store where all writes are applied first, and periodically flushes the in-memory store to files on disk. These files are organized as SSTables (“Sorted String Tables”), that is the data inside is organized by keys such that ultimately lookups of values can be made very efficient.

SSTables can be merged together, which can be useful for example to compact data by removing deleted entries. In fact, once flushing the in-memory data to an SSTable, the corresponding files are never written to anymore, they’ll only ever be read from, or deleted after a successful merge. This means that the data inside any single file is never moved around for any reason whatsoever. And this in turn means that distributing data (or snapshots of data) is actually very easy.

One of the advantages of an embedded key-value store like this is that you can quite easily tune it to exactly your needs, something which is typically a lot harder for out-of-the box systems that are built as generic solutions for all sorts of problems.

So take a look at the code and examples for TeaKV on GitHub.