Having a look at LevelDB-laoliulaoliu-ChinaUnix博客

miraclemiracle.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

laoliulaoliu

博客访问： 4663362
博文数量： 1214
博客积分： 13195
博客等级：上将
技术积分： 9105
用户组：普通用户
注册时间： 2007-01-19 14:41

个人简介

C++,python,热爱算法和机器学习

文章分类

全部博文（1214）

cloud（3）
operation（9）
tornado（4）
mac_os（1）
golang（4）
架构（13）
git（4）
security（29）
shell（1）
macbook（1）
ruby（13）
javascript（15）
design（3）
testing（1）
mac（1）
bigdata（69）
nosql（46）
R（9）
gcj/acm（6）
NLP（10）
小说（3）
matlab（4）
web（44）
java（66）
product（7）
c#（1）
language（4）
machine learning（76）
science（4）
opencourse（2）
windows（3）
search（33）
algorithm（65）
database（51）
compiler（11）
ACE（5）
poem（1）
programming（29）
python（140）
assembly（1）
linux（49）
C++（16）
book（2）
cate（1）
phliosophy（3）
mental（30）
Science fiction（1）
Software（5）
c（23）
network（65）
CS（15）
thinking（10）
BSD（13）
solaris10（2）
life（57）
Debian（16）
economy（7）
Mathematics（57）
OS（8）
ibm（2）
gentoo（32）
未分配的博文（8）

文章存档

2021年（13）

2020年（49）

2019年（14）

2018年（27）

2017年（69）

2016年（100）

2015年（106）

2014年（240）

2013年（5）

2012年（193）

2011年（155）

2010年（93）

2009年（62）

2008年（51）

2007年（37）

我的朋友

Going off on a tangent: Automake

There are two code-trees that essentially make up LevelDB: LevelDB and Snappy (compression library used by LevelDB). It seems to me that Google are using autoconf/automake, which I’m not so familiar with. So if I’m to build leveldb and snappy myself, I should first get up-to-speed with that way of building (I’m a ./configure && make && make install kind of guy). Someone wrote a tutorial. The tools can be downloaded here:

: download + extract + ./configure && make && make install
: download + extract + ./configure && make && make install
: download + extract + ./configure && make && make install

What LevelDB is, the API

The API: Put, Get, Delete, CreateSnapshot, WriteBatch, and RangeIterationwith options for CustomComparators, Sync/Async.

Before going crazy with hello worlds and such, I wanted to briefly get an idea about LevelDB. There are two sides to that coin. One is understanding the API, and another is understanding the implementation (that comes later).

The best explanation of the LevelDB interface is perhaps to read the C++ examples in the LevelDB documentation. It highlights the best features of LevelDB (apart from Put,Get,Delete):

Making multiple operations atomic
Creating consistent snapshots
Iterating over key ranges (something you can’t do in most key-value stores)
Implementing alternative sorting orders for keys using custom Comparators

Even though I don’t normally code in C++, I found these easy enough to understand.

In the following I’ll recreate the examples in Python, and show you how to quickly install the Python binding (including the leveldb library).

Trying LevelDB in Python

Note: The Python bindings don’t support the full feature set of LevelDB. Sadly it seems that it is not possible to provide a custom Comparator. This means that implementing alternative orderings like e.g. a space-filling curve is not as easy as it could have been with arbitrary ordering of keys.

Note: In the C++ examples, a call to e.g. Get returns a Status: Did the read go well? In C++ the value to read is passed by reference to Get, so that the status can be returned I guess. I’m not sure how access the Status of an operation when using Python. Well, just ignore that…

Note: This is the extremely lazy way of getting leveldb up and running in Python. I’m not convinced this is how you should really do it in the “proper” way. However, it works. This also installs the C++ leveldb library, as it comes bundled with .

pip install leveldb

Making a clean slate:

import leveldb
 
leveldb.DestroyDB('./db')

db = leveldb.LevelDB('./db')

Hello world:

db.Put('hello', 'world') # persisted to disc

print db.Get('hello')

Atomically moving a value from one key to another:

# read value to move

value = db.Get('hello')

# create a batch object

batch = leveldb.WriteBatch()

batch.Put('hello2',value)

batch.Delete('hello')

db.Write(batch, sync=True)

# The extra cost of the synchronous write will be amortized across all of the writes in the batch.

Iterate over a key-range:

# put some values

db.Put('aaa','1')

db.Put('bbb','2')

db.Put('ccc', '3')

db.Put('ddd', '4')

# iterate over a range of keys

for k,v in db.RangeIter('b','d'):

 print k,v

# Prints:

# bbb, 2

# ccc, 3

Create a snapshot:

# put a value

db.Put('a','xxx')

# create a read-only snapshot

snapshot = db.CreateSnapshot()

# overwrite value in database

db.Put('a', 'yyy')

# read old value from snapshot

print snapshot.Get('a')

# prints 'xxx'

Note: I’m not really sure how to persist the snapshop to disc using Python.

Going off on a tangent: LevelDB and spatial data

Keys and spatial data

The most obvious way to use keys in LevelDB to store spatial data, is to store keys corresponding to values on a space-filling curve (e.g. Z-order or Hilbert curve). For the Python binding the trouble is that custom comparators are not possible, so the only ordering is alphabetical. I have to think of how this can be mapped to a space-filling curve (maybe it is totally obvious!).

I find that Z-orders are the easiest to understand and remember, since they correspond to QuadTrees.

Values and spatial data

LevelDB uses Snappy to compress the string keys and value, if it is available. Spatial data contains a lot of coordiantes, i.e. floats, so it is interesting how Snappy handles things like integers and floats.

Reading the it seems to me that both integers and floats are stored just as strings. Compression in Snappy comes from repeated byte-sequences in the stream (run-length coding and back-references). So maybe not so hot for spatial data that usually contains a large number of floats, i.e. coordinates. Maybe I missed something in the Snappy spec.

This is just assuming that spatial data is something like GeoJSON, in which case nothing magical will happen One could of course be a little less lazy and do the compression in the application, using e.g. Protocol Buffers or something even more efficient for spatial data.

Deeper into LevelDB

This is the section that talks about how LevelDB is implemented. This section will gradually get better.

Comparison to bitcask

talked on YCombinator/HackerNews about the :

LevelDB is a persistent ordered map; bitcask is a persistent hash table (no ordered iteration). – Sanjay

Also

Bitcask stores a fixed size record in memory for every key. So for databases with large number of keys, it may use too much memory for some applications. bitcask can guarantee at most one disk seek per lookup I think. leveldb may have to do a small handful of disk seeks. – Sanjay

Also

To clarify, leveldb stores data in a sequence of levels. Each level stores approximately ten times as much data as the level before it. A read needs one disk seek per level. So if 10% of the db fits in memory, leveldb will need to do one seek (for the last level since all of the earlier levels should end up cached in the OS buffer cache). If 1% fits in memory, leveldb will need two seeks.. – Sanjay

Comparison to TokyoCabinet

The short story. LevelDB is optimized for writes, and TokyoCabinet for reads. See the following :

TokyoCabinet is something we seriously considered using instead of writing leveldb. TokyoCabinet has great performance usually. I haven’t done a careful head-to-head comparison, but it wouldn’t surprise me if it was somewhat faster than leveldb for many workloads. Plus TokyoCabinet is more mature, has matching server code etc. and may therefore be a better fit for many projects. – Sanjay

and

However because of a fundamental difference in data structures (TokyoCabinet uses btrees for ordered storage; leveldb uses log structured merge trees), random write performance (which is important for our needs) is significantly better in leveldb. This part we did measure. IIRC, we could fill TokyoCabinet with a million 100-byte writes in less than two seconds if writing sequentially, but the time ballooned to ~2000 seconds if we wrote randomly. The corresponding slowdown for leveldb is from ~1.5 seconds (sequential) to ~2.5 seconds (random). – Sanjay

Durability

on HN about the durability of LevelDB. Sanjay also commented on what exactly it means that (for asynchronous writes), namely that the database is not corrupted following a crash, but merely result in incomplete data files.

Bloom-filters

Because of the way leveldb data is organized on disk, a single Get() call may involve multiple reads from disk. The optional FilterPolicy mechanism can be used to reduce the number of disk reads substantially. – LevelDB documentation

Bloom filter based filtering relies on keeping some number of bits of data in memory per key (in this case 10 bits per key since that is the argument we passed to NewBloomFilterPolicy). This filter will reduce the number of unnecessary disk reads needed for Get() calls by a factor of approximately a 100. Increasing the bits per key will lead to a larger reduction at the cost of more memory usage.

LSM-Tree

An is used to increase write-throughput. This makes LevelDB more suitable for random writes-heavy workloads. In contrast KyotoCabinet implements ordered keys using a B-Tree, which is more suitable for read-heavy workloads.

Memtables and SSTables

LevelDB might be similar to Cassandra and other inspired databases when it comes to writes. In Cassandra this is how a write is processed:

Cassandra writes are first written to the CommitLog, and then to a per-ColumnFamily structure called a Memtable. When a Memtable is full, it is written to disk as an SSTable. –