嵌入式K/V数据处理库-Lightning MDB
2013-12-16 13:41:50 阿炯

LMDB 是一个快而小的 key-value 数据存储服务,是由 OpenLDAP 项目的 Symas 开发的。使用内存映射文件,因此读取的性能跟内存数据库一样,其大小受限于虚拟地址空间的大小。采用C语言开发并在OpenLDAP(类BSD)协议下授权。


Symas Lightning Memory-Mapped Database:Lightning MDB (LMDB)


Lightning Memory-Mapped Database Manager (LMDB),Lightning Memory-mapped Database.An extraordinarily fast, memory-efficient database we developed for the Symas OpenLDAP Project.

LMDB 是一个超级快、超级小的 key-value 数据存储服务,是由 OpenLDAP 项目的 Symas 开发的。使用内存映射文件,因此读取的性能跟内存数据库一样,其大小受限于虚拟地址空间的大小。

LMDB is a Btree-based database management library modeled loosely on the BerkeleyDB API, but much simplified. The entire database is exposed in a memory map, and all data fetches return data directly from the mapped memory, so no malloc's or memcpy's occur during data fetches. As such, the library is extremely simple because it requires no page caching layer of its own, and it is extremely high performance and memory-efficient. It is also fully transactional with full ACID semantics, and when the memory map is read-only, the database integrity cannot be corrupted by stray pointer writes from application code.

LMDB是一个基于b树的数据库管理库,它松散地模仿了BerkeleyDB API,但是非常简单。整个数据库在内存映射中,所有数据获取直接从映射内存返回数据,因此数据获取过程中不会有malloc或memcpy调用。因此,库非常简单,因为它不需要自己的页面缓存层,而且它的性能和内存效率非常高。它还具有全ACID语义的完全事务性,并且当内存映射为只读时,不会通过应用程序代码中的偶然指针写入破坏数据库完整性。

The library is fully thread-aware and supports concurrent read/write access from multiple processes and threads. Data pages use a copy-on- write strategy so no active data pages are ever overwritten, which also provides resistance to corruption and eliminates the need of any special recovery procedures after a system crash. Writes are fully serialized; only one write transaction may be active at a time, which guarantees that writers can never deadlock. The database structure is multi-versioned so readers run with no locks; writers cannot block readers, and readers don't block writers.

该库是完全线程感知的,支持来自多个进程和线程的并发读/写访问。数据页使用一种"cow"策略,因此不会覆盖任何活动数据页,这也提供了对损坏的抵抗力,并消除了系统崩溃后任何特殊恢复过程的需要。写完全序列化,一次可能只有一个写事务是活动的,这保证了写者永远不会死锁。数据库结构是多版本的,所以读没有锁,读写之间不会相互彼此阻止。

Unlike other well-known database mechanisms which use either write-ahead transaction logs or append-only data writes, LMDB requires no maintenance during operation. Both write-ahead loggers and append-only databases require periodic checkpointing and/or compaction of their log or database files otherwise they grow without bound. LMDB tracks free pages within the database and re-uses them for new write operations, so the database size does not grow without bound in normal use.

与使用预写事务日志或只使用附加数据写的其他众所周知的数据库机制不同,LMDB在操作期间不需要维护。而预写日志记录和仅附加的数据库都需要定期检查和或压缩它们的日志或数据库文件,否则它们将不受限制地增长。LMDB跟踪数据库中的空闲页面并将其用于新的写操作,因此在正常使用时,数据库大小不会无限制地增长。

特点

Ordered-map interface
keys are always sorted; range lookups are supported(键是总是排序,支持范围查询)

Fully-transactional
full ACID semantics with MVCC(使用了MVCC实现了ACID语义)

Reader/writer transactions
readers don't block writers; writers don't block readers(读写均使用了事务,互不影响)

Fully serialized writers
writes are always deadlock-free(写是全串行化,因此是无死锁,这是事务处理中的最高级别)

Extremely cheap read transactions
can be performed using no mallocs or any other blocking calls(读事务开销极低)

Multi-thread and multi-process concurrency supported
Environments may be opened by multiple processes on the same host(同一主机支持多进程或多线程并发)

Multiple sub-databases may be created
transactions cover all sub-databases(事务覆盖所有数据库)

Memory-mapped
allows for zero-copy lookup and iteration(允许零拷贝查找和迭代)

Maintenance-free
no external process or background cleanup or compaction required(免维护,不需要外部进程或后台清理压缩之类的操作)

Crash-proof
no logs or crash recovery procedures required(崩溃后的恢复过程不需要借助于日志)

No application-level caching
LMDB fully exploits the operating system's buffer cache(无需应用层的缓存,充分利用了操作系统的缓冲区缓存)

32KB of object code and 6KLOC of C
fits in CPU L1 cache for maximum performance(大小适中,可置于CPU L1缓存中,以获得最佳性能)


Technical description(技术描述)

Internally LMDB uses B+ tree data structures. The efficiency of its design and small footprint had the unintended side-effect of providing good write performance as well. LMDB has an API similar to Berkeley DB and dbm. LMDB treats the computer's memory as a single address space, shared across multiple processes or threads using shared memory with copy-on-write semantics (known historically as a single-level store). Due to most former modern computing architectures having 32-bit memory address space limitations, which imposes a hard limit of 4 GB on the size of any database using such techniques, the effectiveness of the technique of directly mapping a database into a single-level store was strictly limited. However, today's 64-bit processors now mostly implement 48-bit address spaces, giving access to 47-bit addresses or 128 terabytes of database size,making databases using shared memory useful once again in real-world applications.
在内部,LMDB使用B+树数据结构。它的设计效率占用空间很小也产生了意想不到的效果,即提供了良好的写性能,也能提供一个类似于Berkeley DB相关的dbm的API。LMDB将计算机的内存视为一个地址空间,使用写时复制语义(在历史上称为单级存储)共享多个进程或线程。由于大多数旧的现代计算体系结构都有32位内存地址空间限制,因此使用这种技术对任何数据库的大小强加了4GB的硬限制,因此直接将数据库映射到单一级别存储的技术的有效性受到了严格的限制。然而,如今的64位处理器大多实现了48位地址空间,可以访问47位地址或128TB的数据库大小,使得使用共享内存的数据库在现实应用程序中再次变得有用。

Specific noteworthy technical features of LMDB are:(LMDB的具体值得注意的技术特点是:)

Its use of B+ tree. With an LMDB instance being in shared memory and the B+ tree block size being set to the OS page size, access to an LMDB store is extremely memory efficient.
它使用的是B+树。由于LMDB实例位于共享内存中,并且B+树块大小被设置为OS页面大小,因此访问存储于内存的LMDB访问效率非常高。

New data is written without overwriting or moving existing data. This results in guaranteed data integrity and reliability without requiring transaction logs or cleanup services.
新数据是在不重写或移动现有数据的情况下写入,这样可以保证数据的完整性和可靠性,而不需要借助于事务日志或清理进程。

The provision of a unique append-write mode (MDB_APPEND) which is implemented by allowing the new record to be added directly to the end of the B+ tree. This reduces the number of reads and write page operations, resulting in greatly-increased performance but requiring that the programmer is responsible for ensuring keys are already in sorted order when storing into the DB.
提供唯一的追加-写入模式(MDB_APPEND),该模式通过允许将新记录直接添加到B+树的末尾来实现。这减少了读和写页面操作的数量,会大幅的性能提高,但是要求程序员负责确保在存储到数据库时已经按照相关的顺序进行过排序。

Copy-on-write semantics help ensure data integrity as well as providing transactional guarantees and simultaneous access by readers without requiring any locking, even by the current writer. New memory pages required internally during data modifications are allocated through copy-on-write semantics by the underlying OS: the LMDB library itself never actually modifies older data being accessed by readers because it simply cannot do so: any shared-memory updates automatically create a completely independent copy of the memory-page being written to.
Copy-on-write语义有助于确保数据完整性,并提供事务保证和读的同时访问,而不需要任何锁定,即使是当前的写入也是如此。新内存页面所需的内部数据修改分配通过底层操作系统:经由LMDB库的读取操作本身从未被访问修改过的旧数据,因为它不能这样做:任何共享内存的更新自动创建一个完全独立的内存页面的副本被写入。这是也COW的定义原理。

As LMDB is memory-mapped, it can return direct pointers to memory addresses of keys and values through its API, thereby avoiding unnecessary and expensive copying of memory. This results in greatly-increased performance (especially when the values stored are extremely large), and expands the potential use cases for LMDB.
由于LMDB是内存映射的,它可以通过其API返回指向键和值的内存地址的直接指针,从而避免不必要和高昂的内存复制。这会大大提高性能(特别是当存储的值非常大时),并扩展LMDB的潜在使用价值。

LMDB also tracks unused memory pages, using a B+ tree to keep track of pages freed (no longer needed) during transactions. By tracking unused pages the need for garbage-collection (and a garbage collection phase which would consume CPU cycles) is completely avoided. Transactions which need new pages are first given pages from this unused free pages tree; only after these are used up will it expand into formerly unused areas of the underlying memory-mapped file. On a modern filesystem with sparse file support this helps minimise actual disk usage.
LMDB还可以跟踪未使用的内存页,使用B+树跟踪事务期间释放(不再需要)的页。通过跟踪未使用的页面,可以完全避免垃圾收集(以及消耗CPU周期的垃圾收集阶段)的需要。需要新页面的事务是第一个从未使用的空闲页面树中获得的页面,只有当这些页的空间用完之后,它才会扩展到底层内存映射文件的以前未使用的区域。在具有稀疏文件支持的现代文件系统上,这有助于将实际的磁盘空间使用最小化。

其最主要特性有:
基于文件映射IOmmap
基于B+树的key-value接口
基于MVCCMulti Version Concurrent Control的事务处理
非常简单的类bdbberkeley db的api


Lmdb的基本思路是使用mmap访问存储,不管这个存储实在内存上还是在持久存储上。

lmdb的所有读取操作都是通过mmap将要访问的文件只读地装载到宿主进程的地址空间,直接访问相应的地址,这减少了硬盘、内核地址控件和用户地址空间之间的拷贝,也简化了平坦的"索引空间"上的实现,因为使用了read-only的mmap,规避了因为宿主程序错误将存储结构写坏的风险。IO的调度由操作系统的页调度机制完成。

而写操作,则是通过write系统调用进行的,这主要是为了利用操作系统的文件系统一致性,避免在被访问的地址上进行同步。在lmdb上的读取操作,直接读取了mmap所装载的内存地址,那么,如果所读取的内容被修改了,不是出现了不一致的结果吗?事实是,lmdb上的所有内容都不会被修改。

LMDB用MVCC处理并发读写访问的问题,其要点在于:
每一个变更对应一个版本
变更发生的时候不修改原来的版本
读者进入的时候会取得一个版本,它只读取这个版本的内容


对于一个树形数据结构,当它的一个节点上发生变更的时候,就创建一个新的节点,在新的节点上容纳这些变更,由于这个节点的父节点也要发生变更从指向原来的节点变更为指向新的这个节点,那么重复上述过程,即,实际发生变更的节点通往根节点路径上的所有节点都必须重新创建一份,当变更工作完成的时候,我们通过一个原子操作提交这个变更版本。每个新的版本就会产生一个新的跟节点,最终的存储中就会保留历史上所有的版本,当然所有版本中就包括了当前所有读者所读的版本,因此变更不会对读者产生任何影响,所以写可以不被读阻塞。

我们讨论了读的情况,上述方法承诺给每一个读一个一致的版本就是它进入时所得到的那个版本,但没有承诺给它一个最新的版本,我们考虑在一个事务中,依据一个值变更另一个值的情况,很显然,当我们想要提交变更的时候,很可能我们进入时所得到的版本已经不是最新的,也就是说,在我们的进入和提交之间发生了另一个提交,这种情况下,如果提交了变更就会发生不一致的状况,譬如一个单调递增的计数器,就因此可能“吃掉”多个递增。为了解决这个问题,我们只要在提交时检查我们进入的版本是否最新版本即可,这常常可以通过一个CAS原子操作完成,如果这个操作失败,就重新进入存储,重做整个事务。这样,读也可以不被可能的写阻塞,即典型的乐观锁。

按照上述描述,我们的存储中保存了所有的历史版本,这是否必要呢?事实上,我们所以保存历史版本,是因为有可能有读者读它,新的读者总是读到最新的版本,老的版本就没有用了,如果一个版本上没有任何读者和写者,那它就没有必要存在了。我们可以依据上述原理实现旧版本的回收,不过lmdb做了一些改进:

一个基本的事实是,新的读者总是去读最新的版本,因此,保留所有版本的根节点是不必要的,我们只需要保存一个最新版本的根节点和一个用于提交更加新的变更的节点即可。

所有读者进入的时候,都复制当前的根节点的一个snapshot因为它随后很可能被改变

当一个变更被提交的时候,它清楚因为这个提交,哪些节点迟早会变成没用的--所有因为这次变更发生过修改的节点,都会在相应版本的读者退出后变成无用的节点,因此可以回收

因为上述理由,最新的版本可以收集所有需要被回收的节点和它们所属的版本

维护一个读者的slot,可以从这里面查到最小的版本,比这个版本小的版本所属的可回收节点都可以进行回收



综上所述,我们现在只有两个根节点,所有变更最终都要修改这个根节点,这样所有的写事实上要被序列化。这并没有降低性能,理由是这样的:如我们上面所述的,当两个变更并发进行的时候,确切的说,是进入同一个版本,并依据这个版本进行了某些变更,然后要提交这些变更,两者中必有一个事务必须重做,因为在它的提交和进入之间有别的提交,这个结论可以推广到多个并发的情况。也就是说,变更事实上是序列化的,由于不同的变更之间没有阻塞,MVCC的方案消耗了更多的计算资源所有失败的提交都要被重做。因此,lmdb用锁序列化了所有的变更操作。


一些关键问题的问答

Can LMDB support concurrent writes?(LMDB能支持并发写入吗?)

According to lmdb documentation lmdb handles the concurrent writes on its own. When multiple read_write transactions are opened at once.Except the active write transaction Lmdb makes all the other write transactions to wait until the current active write transaction commits. Thus it handles concurrent writes.
根据lmdb文档,lmdb自己处理并发写操作。当同时打开多个读写事务时,除了活动写事务之外,Lmdb使所有其它写事务等待到当前活动写事务提交之后。因此它是伪并发写操作的,由上文得知它是串行化的,故它不可能是支持并发操作的,其实这样更高效,因为并发的代价太高了。

Doesn't LMDB have a single, global write lock?(LMDB没有一个单独的全局写锁吗?)

It does, yes. The best architecture when using LMDB is to have a single writer thread, and many independent reader threads.
是的。使用LMDB时,最好的体系结构是只有一个写线程和许多独立的读线程。

That said, it is nice to be able to run the occasional separate process to do some maintenance task on the database (e.g. to do some writes), and the global lock comes in handy then.
也就是说,能够偶尔的运行独立进程来在数据库上执行一些维护任务(例如一些写操作)是很好的,并且全局锁在那时就派上了用场。

The main reason a single write thread design isn't a problem is you can do a million or more separate, fully transactional and isolated write transactions per second on a single core with LMDB. Writes are simply not the bottleneck.
单独的写线程这种设计不存在问题的主要原因是,可以使用LMDB在单个核上每秒执行超百万或更多独立的、完全事务性的和独立的写事务。写操作不会存在瓶颈。

You can't get anywhere near that kind of write performance with a fully transactional and concurrent write system like CockroachDB, not even with 100x the hardware. However, some of the eventually consistent databases, like ScyllaDB, can scale up to more total writes per second (but of course you give up the transactions, and the consistency).
像CockroachDB这样的完全事务性和并发的写系统,甚至连提升百倍的硬件都无法达到这样的并发写入性能。然而一些最终一致的数据库,如ScyllaDB,可以扩展为每秒更多的写操作(当然就得放弃事务和一致性了)。


以下是其官方页面上的对主要开发者Howard Chu的技术访谈及相关的Q&A。

LMDB: Q&A with Symas Corporation's Howard Chu About Symas's Lightning Memory-Mapped Database


Howard Chu, the Chief Architect for the OpenLDAP Project and CTO of Symas Corporation, discusses Symas's Lightning Memory-Mapped Database (LMDB), the memory-mapped database that was developed and contributed to the OpenLDAP Project by Symas. In this interview we discuss the nitty gritty of the database and why it's "not just another new database".
OpenLDAP项目和Symas公司的CTO的首席架构师Howard Chu讨论了Symas的Lightning内存映射数据库(LMDB),这是由Symas开发并为OpenLDAP项目所使用的内存映射数据库。在这次采访中,我们讨论了数据库的细节,以及为什么它"不只是另一个新的数据库"。

Q: What exactly is LMDB?LMDB究竟是什么?
A: LMDB is a B+tree database that exploits the memory-mapped file capabilities available in modern operating systems. It is fully transactional, ACID compliant, and uses multi-version concurrency control (MVCC) to avoid the need for most database locks. Its compiled size is small enough to fit completely inside the L1 cache of most modern CPUs, making it extremely fast.
LMDB是一个B+树数据库,利用现代操作系统中可用的内存映射文件功能。它是完全事务性的,兼容ACID的,并且使用多版本并发控制(MVCC)来避免大多数数据库锁的需要。它的编译大小足够小,可以完全嵌入到大多数现代cpu的L1缓存中,这使得它非常快速。

Q: Is this a completely new design?这是一个全新的设计吗?
A: Not really. LMDB builds on several pieces of work done since the 1960s. The notion of a memory-mapped database came from work done on the Multics operating system in the 1960s, based on a concept called "Single-Level Store". It was mostly abandoned by the mid-1990s when available disk space outstripped the the capacity of 32-bit address spaces (4 gigabytes). This has changed with the widespread availability of systems with 64-bit addressing, which puts the upper bound of database size at 8 exabytes. B+trees were first described in 1972 and are commonly used in databases and some file systems. The code implementing LMDB's database structure has its origins in the append-only B-tree code written in 2010 by Martin Hedenfalk for OpenBSD's ldapd implementation. I also added features available in Berkeley DB (BDB), which we used in OpenLDAP's back-bdb and back-hdb backends, to simplify writing a slapd backend with LMDB. So the basics have been around for many years. I've just brought them together and made selective improvements on them to create LMDB.
不是真的。LMDB基于上世纪60年代以来完成的几项工作。内存映射数据库的概念源于20世纪60年代在Multics操作系统上所做的工作,基于一个名为“单级存储”的概念。上世纪90年代中期,当可用磁盘空间超过32位地址空间(4g)的容量时(原文在此确定是硬盘而不是指内存?),它基本上被抛弃了。随着64位寻址系统的广泛可用性,这一点已经改变,这使得数据库大小的上限为8eb。B+树在1972年首次被描述,通常用于数据库和一些文件系统。实现LMDB数据库结构的代码起源于2010年由Martin Hedenfalk编写的OpenBSD的ldapd实现的append-only B-tree代码。我还添加了Berkeley DB(BDB)中可用的特性,用LMDB编写了简化的slapd后端。所以这些基础知识已经存在很多年了,我把它们组合在一起,对它们进行了选择性的改进以创建LMDB。

Q: Why did you choose to base LMDB on an append-only design?为什么选择将LMDB使用仅追加模式?
A: The motivation behind an append-only design is that existing data never gets overwritten, making it impossible for a database to be corrupted by an interrupted operation (e.g. due to system crash). As such, these database designs are crash-proof and need no intensive recovery process when restarting after a crash. LMDB achieves the same goal (instantly available, crash-proof) but uses disk space more efficiently. Not everyone is like Google with endless strings of disks at their disposal.
它的程序设计背后的动机是,现有数据永远不会被覆盖,因此数据库不可能被中断的操作破坏(例如,由于系统崩溃)。因此,这些数据库设计是崩溃安全的,在崩溃后重新启动时不需要密集的恢复过程。LMDB实现了相同的目标(立即可用,快速恢复),但更有效地使用磁盘空间。并不是每个人都像谷歌那样拥有无穷尽的磁盘。

Q: What do you think are the biggest contributors to LMDB's performance?你认为什么是LMDB性能的最大改进?
A: Database reads can be satisfied with no memory allocations (mallocs) or memory-to-memory copies (memcpy). The other major contributor is that no locks are required to satisfy a read operation. All of those can be significant performance bottlenecks in typical code. This is part of the reason that LMDB reads are so much faster than every other database, and why reads scale perfectly linearly across multiple CPUs - there are no bottlenecks in the read path.
不使用内存分配(mallocs)或内存到内存拷贝(memcpy)就可以满足数据库读取,另一个主要的改进是不需要任何锁来满足读操作。所有这些都可能是典型代码中显著的性能瓶颈。这是LMDB读取速度比其他所有数据库快得多的部分原因,也是为什么读取速度在多个cpu之间是完全线性扩展的原因——在读取路径中没有瓶颈。

Q: Databases that provide MVCC are notorious for their need of excess storage space- typically double that of the maximum expected database size. You also mentioned that the B+tree code is append-only. Doesn't that make LMDB very wasteful in its use of disk space?
提供MVCC的数据库因需要过多的存储空间而臭名昭著——通常可能是数据库大小的两倍。您还提到了B+树代码是仅仅是附加的,这难道不使LMDB在使用磁盘空间时非常浪费吗?
A: No. LMDB maintains a second B+tree that tracks pages that have been freed. Updaters will re-use pages from this free list whenever possible and the database size will remain fairly static. This is a key advantage of LMDB over other MVCC databases like CouchDB.
不。LMDB维护第二个B+树,用于跟踪已释放的页面。只要有可能,更新程序将重用这个自由列表中的页面,并且数据库大小将保持相当静态。这是LMDB优于其他MVCC数据库(如CouchDB)的一个关键优势。

Q: Another problem with MVCC databases is that updates may be denied while garbage collection takes place. Is that the case with LMDB?
MVCC数据库的另一个问题是在进行垃圾收集时更新可能被拒绝,LMDB是这样吗?
A: No. By using the second B+tree to manage free space, we avoid the need to perform garbage collection in the first place.
不。通过使用第二个B+树来管理空闲空间时我们首先避免了执行垃圾收集的需要。

Q: Is the size of the database restricted to the amount of physical memory in the system?
数据库的大小是否受限于系统中物理内存的数量?
A: No. The maximum database size is constrained only by the amount of disk space available and by the size of the address space on the machine. For 32-bit implementations it's restricted to approximately 2^31 bytes (2 GB), and for 64-bit implementations, which typically bring 48 address bits out of the CPU, it's restricted to 2^47 bytes (128 TB). The operating system takes care of moving data in and out of available memory as needed. This means that database sizes can be many multiples of available physical memory.
不。最大数据库大小仅受可用磁盘空间数量和机器上地址空间大小的限制。32位实现限制在大约2^31个字节(2GB)和64位实现,通常把48个地址位的CPU、限制为2^47字节(128TB)。操作系统负责根据需要将数据移进或移出可用内存,这意味着数据库大小可以是可用物理内存的许多倍。

Q: How does that work?这是如何工作的呢?
A: If a portion of the database that is not currently in memory is accessed, the OS suspends the database task, reads the needed data from disk into memory, and then resumes the database task. If there's no memory available at the moment, another page from the memory map is reused to hold the data. An interesting point here is that since most of the pages in LMDB's map are never written/dirty (since LMDB does copy-on-write), they don't need to be flushed to disk. They can be reused immediately by the kernel, the old contents are discarded with zero page-out overhead.
如果访问了当前不在内存中的数据库的一部分,操作系统将挂起数据库任务,从磁盘读取所需的数据到内存中,然后继续数据库任务。如果此时没有可用的内存,则重用内存映射中的另一个页面来保存数据。这里有趣的一点是,由于LMDB的映射中的大多数页面都不是写/脏的(因为LMDB进行写复制),所以不需要将它们刷新到磁盘。它们可以被内核立即重用,旧的内容会被丢弃,并产生零页面溢出开销。

Q: How does available memory relate to database performance then?那么数据库性能与可用内存是如何相关呢?
A: Moving data between disk and memory takes time, so having more memory available for the database to use reduces the number of times that movement takes place and improves performance. The best case is having enough memory to hold the entire database in memory, but that's not always practical. As a general rule you want to have enough memory to accommodate as much of your "working set", the data that's commonly accessed, as possible. That will insure that most database operations use data that's already in memory.
在磁盘和内存之间移动数据需要时间,因此数据库使用更多的内存可以减少移动的次数,以提高性能。最好的情况是拥有足够的内存来在内存中保存整个数据库,但这并不总是实用的。一般来说,您希望拥有足够的内存来容纳尽可能多的“工作集”(通常访问的数据)。这将确保大多数数据库操作使用内存中的数据。

Q: What does the API look like?API是什么样子的?
A: The API was deliberately modeled on that of Oracle's Berkeley DB to ease porting applications that use that database.
该API是尽可能模仿Oracle的Berkeley DB以方便移植使用该数据库的应用程序。

Q: Does LMDB have the same access methods as Berkeley DB?LMDB是否具有与Berkeley DB相同的访问方法?
A: No. While LMDB is basically modeled after BDB, it is overall much simpler. BDB provides extensible hashing as well as B-tree based methods. LMDB only implements B+trees. Hashing is cache-unfriendly by design, and doesn't perform well as data volumes grow. B+trees leverage locality of reference thus maximizing the efficiency of system caches.
不。虽然LMDB基本上是按照BDB建模的,但总的来说要简单得多。BDB提供可扩展的散列和基于b树的方法,LMDB只实现了B +树。Hashing在设计上是缓存不友好的,并且在数据量增加的时候表现的不好。B+树利用局部引用从而最大化系统缓存的效率。

Q: Does LMDB have caches or need special tuning?LMDB有缓存或需要特殊调优吗?
A: No. The SLS design means that database reads are satisfied directly from the file system cache without the need to copy data into separate database buffers first. This is such a simple design that there is nothing to tune. The choice of underlying file system can have a strong impact on database performance, though, and there are several file system tuning parameters that can be tweaked to improve performance in certain applications.
不。SLS设计意味着数据库读取直接从文件系统缓存中获得满足,而不需要首先将数据复制到单独的数据库缓冲区中。这是如此简单的设计,没有什么可调整的。但是,底层文件系统的选择可能会对数据库性能产生很大的影响,并且有几个文件系统调优参数可以被调整以提高某些应用程序的性能。

Q: How do you back up an LMDB instance?(如何备份一个LMDB实例?)
A: MDB includes an mdb_copy utility that can make a complete copy of an existing LMDB instance. The new copy of the database will itself be a proper LMDB database, so there's no need of a separate mdb_load utility.
MDB包含一个mdb_copy实用程序,该实用程序可以创建现有LMDB实例的完整副本。新的数据库副本本身就是一个合适的LMDB数据库,因此不需要单独的mdb_load实用程序。

Q: Can you back up an LMDB instance while it is actively updating, or do you need to stop it?
是否可以在LMDB实例正在积极更新时进行备份,还是需要停止它?
A: There is no need to stop the database while performing a backup, but… LMDB uses MVCC, so once a read transaction starts, it is guaranteed a self-consistent view of the database until the transaction ends. Keeping a read transaction open for a very long time will prevent the dirty page reclaimer from re-using the pages that are referenced by that read transaction. As such, any ongoing write activity may be forced to use new pages. The database file size can grow very rapidly in these situations until the read transaction concludes.
在执行备份时不需要停止数据库,但是……LMDB使用MVCC,所以一旦开始读取事务,就保证在事务结束之前,数据库的视图是自洽的。长时间保持读事务的打开将防止脏页回收器重用读事务引用的页面。因此任何正在进行的写活动都可能被迫使用新的页面。在这些情况下,数据库文件大小可以快速增长,直到读取事务结束。

Q: One of the issues with B-tree databases like Berkeley DB is that they periodically experience significant slowdowns for certain requests when B-tree re-balancing takes place. What is the typical response-time distribution for inserts with LMDB?
像Berkeley DB这样的B-tree数据库的问题之一是,当B-tree重新平衡发生时,它们会周期性地遇到某些请求的显著放缓。使用LMDB的分布插入的典型响应时间是什么?
A: LMDB is also a B+tree; at a high level it will have characteristics similar to those of Berkeley DB. It's always possible for an insert that results in a page split to have a cascade effect, causing page splits to propagate all the way back up to the root of the tree. But in general, LMDB is still more efficient than BDB so the actual variance in performance caused by re-balancing will be much smaller.
LMDB使用的B+树在较高的水平上,它将具有类似于Berkeley DB的特征。导致页面分割的插入总是可能产生级联效果,导致页面分割一直传播到树的根。但总的来说,LMDB仍然比BDB更有效,因此由重新平衡引起的性能与实际差异要小得多。

Also note that if you're bulk-loading records in sorted order, using the available MDB_APPEND option basically transforms the adds into a series sequential write operations - when one page fills, instead of splitting it in half as usual, we just allocate a new sibling page and continue filling it. Page "splits" of this sort can still ripple upward, but they're so cheap as to be unmeasurable.
还要注意的是,如果您是按排序顺序加载的记录,那么使用可用的MDB_APPEND选项将添加到一系列顺序写操作中——当一个页面填充时,而不是像往常一样将其拆分为一半,我们只是分配一个新的兄弟页面并继续填充它。这种类型的页面“分割”仍然可以向上移动,但是它们是如此的低成本以至于不好测量。

The test results that I posted for memcachedb bear this out. In a test using four client threads, the average latency for a BDB write operation was 655 microseconds and the maximum was 336 milliseconds. In the same test, the average latency for LMDB was 127 microseconds and the maximum latency was 517 microseconds. So while on average LMDB performed 5 times faster than BDB, the difference in maximums was three orders of magnitude. So LMDB has a much more predictable response profile than BDB.
我再给memcachedb的测试结果证明了这一点。在使用四个客户机线程的测试中,BDB写操作的平均延迟为655微秒,最大延迟为336毫秒。在同一测试中,LMDB的平均延迟为127微秒,最大延迟为517微秒。因此,尽管LMDB平均运行速度是BDB的5倍,但最大值的差异是3个数量级。因此,LMDB比BDB有更可预测的响应配置文件。

Q: Does an LMDB instance exploit multiple cores? If so, what is the structure of this usage?
LMDB实例是否利用多个核心?如果是,这个用法的结构是什么?
A: Within an LMDB environment, multiple readers can run concurrently with a single writer. Readers are never blocked by anything and do not themselves block writers.
在LMDB环境中,多个读取可以与单个写入同时运行。读者永远不会被任何东西阻塞,也不会自己阻塞写入者。

Q: How does LMDB compare to other databases such as SAP HANA, Hadoop, and Cassandra?
LMDB与其他数据库如SAP HANA、Hadoop和Cassandra相比如何?
A: SAP HANA is an in-memory database and has all the same drawbacks as every other in-memory DB - it is limited to the size of physical RAM, which makes it useless for anything other than a constrained cache. It probably also suffers from excessive use of malloc and memcpy, like most other in-memory databases.
SAP HANA是一个内存中的数据库,它的缺点和内存中的其他数据库一样——它的大小限制在物理RAM的大小上,这使得它对于任何其它的因素都是无用的。与大多数内存中的数据库一样,它可能还会受到malloc和memcpy的过度使用的困扰。

Hadoop and Cassandra are both nice computer science projects, but they're also both written in Java. They work, but we can do better with LMDB.
Hadoop和Cassandra都是很好的计算机科学项目,但它们都是用Java编写的。它们是有效的,但是我们可以用LMDB做得更好。

Q: We've talked a lot about design- what about performance? Can you give me some round numbers for how LMDB stacks up to other databases?
我们已经谈论了很多关于设计与性能方面呢?你能给我一些关于LMDB与其它数据库?
A: Sure. We started with a fairly small system by today's standards- 32GB of RAM, a 512 GB SSD, and 4 processors running at 2.2 GHz. We compared LMDB to the Berkeley DB in the OpenLDAP use cases using a 10-million entry database. We found that with LMDB the time to bulk-load the database was one third of that required for BDB, with the load time going from 113 minutes to 20 minutes. Read performance was where we saw the greatest improvement: LMDB posted search rate results 31,674 searches per second- more than twice that of BDB's 14,567 searches per second. OpenLDAP's memory footprint after running that test under LMDB was just about a quarter of what BDB required, so it's clear that memory efficiency is also much better with LMDB.
确定。按照今天的标准,我们从一个相当小的系统开始——32GB的RAM、512gb的SSD和运行在2.2 GHz上的4个处理器。我们将LMDB与使用1000万条目数据库的OpenLDAP用例中的Berkeley DB进行了比较。我们发现,使用LMDB,数据库的容量加载时间是BDB所需时间的三分之一,加载时间从113分钟到20分钟。读取性能是我们看到的最大改进:LMDB每秒发布的搜索率为31674次,是BDB每秒14567次搜索的两倍多。在LMDB下运行该测试后,OpenLDAP的内存占用仅为BDB所需内存的四分之一左右,因此很明显,LMDB的内存效率也要高得多。

We also ran a set of concurrency tests on a larger, faster machine. In those tests performance improvements ranged from about a 2x increase in a 2-processor configuration to a 15x increase in a 16-processor configuration. This is due to LMDB's lockless read design. With randomly-generated searches across the entire database, we reached 119,000 searches/second vs 67,000 searches/second for BDB. These numbers are extremely good- no other directory server we tested has ever reached even half of what BDB can do. Update performance is excellent and continues to improve. In these tests a mixed search-update job pegged at 40,600 ops/second for HDB, while LMDB performed about about 86,000 ops/second. We are continuing to work on write performance, and we expect it to continue to improve in later releases. [Note: Additional benchmarking information can be found at https://symas.com/mdb/#bench]
我们还在一台更大、更快的机器上运行了一组并发测试。在这些测试中,性能改进的范围从2个处理器配置的2倍增加到16个处理器配置的15倍。这是由于LMDB的无锁读设计。通过在整个数据库中随机生成搜索,我们达到了每秒119,000次搜索,而BDB每秒67,000次搜索。这些数字非常好——我们测试的其他目录服务器甚至还没有达到BDB的一半。更新性能非常优秀,并且不断改进。在这些测试中,HDB的搜索更新工作速度为每秒40,600次,而LMDB大约每秒86,000次。我们将继续研究写性能,并期望在以后的版本中继续改进。[注意:可以在https://symas.com/mdb/#bench中找到其他基准测试信息]。

Q: Where do you see LMDB having the most impact?
你认为LMDB在哪里最有影响力?
A: It can replace pretty much any piece of code where Berkeley DB is being used, and the product will be many times faster with a smaller memory footprint. Obviously it's a game changer for OpenLDAP and it's seeing a tremendous acceptance. Where I really see benefit, though, is in mobile devices as a replacement for SQLite, which is used pervasively in mobile OSes for storing settings, configurations, and data. SQLite has a much larger footprint and is quite a bit less efficient than LMDB, so mobile devices that use LMDB instead of SQLite will be more responsive and will be able to do more. Other areas I think this would have a big impact include session management in web applications, replacing things like memcache and memcacheDB.
它可以替换Berkeley DB正在使用的几乎所有代码,并且该产品将以更小的内存占用速度快许多倍。显然,它是OpenLDAP的一个游戏规则改变者,它看到了巨大的应用。不过,我真正看到的好处是,移动设备可以替代SQLite,SQLite在移动操作系统中广泛使用,用于存储设置、配置和数据。SQLite的内存占用比LMDB大得多,而且效率也比LMDB低得多,因此使用LMDB而不是SQLite的移动设备将会更有响应性,并且能够做得更多。我认为这将产生重大影响,其他领域包括web应用程序中的会话管理,可取代memcache和memcacheDB之类的东西。

Q: I understand you've run benchmarks with several different types of file systems to see how LMDB behaves with them. What can you tell me about that work?
我知道您已经使用几种不同类型的文件系统运行了基准测试,以了解LMDB如何使用它们。关于那项工作你能告诉我些什么?
A: Yes, and we've also run against other databases that are either established or now coming into vogue. The results of that work has been collected at https://symas.com/mdb/#bench.
是的,我们在其他已经的或正在流行的数据库以做过测试。该工作的结果已在https://symas.com/mdb/#bench中有示。

As a general rule, all of the journaling systems perform worse than plain old Linux ext2. However, that's just in their default configuration with their journals embedded in the same partition as their filesystem. I also set them up using an external journal. In this case, since I had no other hard drive available, I created a 1GB file on tmpfs (temporary file system) and used a loopback device to get mkfs to use it.
一般来说,所有有日志记录系统的性能都不如普通的旧Linux ext2。然而,这只是在它们的默认配置中,它们的日志嵌入到与文件系统相同的分区中。我还使用外部日志来设置它们。在这种情况下,由于没有可用的其他硬盘驱动器,我在tmpfs(临时文件系统)上创建了一个1GB的文件,并使用一个loopback设备来获取mkfs来使用它。

With an external journal the synchronous write results on JFS are quite fast, faster than even ext2. If you have an application where fully synchronous transactions are required, JFS looks to be the way to go.
使用外部日志,JFS上的同步写结果非常快,甚至比ext2还要快。如果您有一个应用程序需要完全同步的事务,那么JFS看起来就是解决方法。

I have not seen anyone else doing filesystem benchmarks with external journals. It seems to me most people are using these filesystems in quite sub-optimal configurations.
我还没有见过其他人使用外部日志执行文件系统基准测试。在我看来,大多数人在使用这些文件系统时都处于非常次优配置中。

Q: What about clustering? Can LMDB be used in a clustered environment?
集群呢?可以在集群环境中使用LMDB吗?
A: We've looked at using LMDB as the backend for several existing distributed database systems, such as Hadoop, Cassandra, MongoDB, Riak, and HyperDex, among others. Many of these are poor candidates because their data stores are integral parts of their software architectures and it would take a great deal of effort to work LMDB in as a backend. Some, like Riak and HyperDex, have a very well-defined modular backend architecture, which makes them much better targets for adapting to LMDB. We've completed a port to HyperDex and the Riak folks report that they have an experimental backend.
我们已经研究过使用LMDB作为几个现有分布式数据库系统的后端,如Hadoop、Cassandra、MongoDB、Riak和HyperDex等。其中许多都不是很好的候选者,因为它们的数据存储是它们的软件体系结构的组成部分,并且将LMDB作为后端进行了许多工作,也花费了大量的精力。有些,如Riak和HyperDex,具有非常良好的模块化后端体系结构定义,这使它们更适合于适应LMDB。我们已经完成了对HyperDex的一个移植,Riak人员报告说他们有一个实验后端。

Q: Are you encouraging other Open Source projects to adopt LMDB?
您是否鼓励其他开放源码项目采用LMDB?
A: Yes. Several projects have already integrated LMDB with their code, and we urge anyone interested in using LMDB to contact us if they need assistance.
是的。一些项目已经将LMDB与他们的代码集成在一起,我们鼓励任何有兴趣使用LMDB的人使用它,如果他们需要帮助,请联系我们。

Q: What, if any, other projects are adopting LMDB for use as their database?
其他项目采用LMDB作为数据库(如果有的话)么?
A: LMDB was integrated with OpenDKIM several months ago and is now available in public releases. Symas did the work to integrate LMDB with Heimdal Kerberos, Postfix, and Cyrus SASL, and contributed the results to each of the projects. We expect LMDB support to appear in public releases of these packages soon. We've also developed versions of MemcacheDB and SQLite3 that use LMDB, and that code was posted to Gitorious. As an interesting side-note, according to the memcache testing utility, MemcacheDB with LMDB is faster than the memory-only version of Memcache. LMDB is also integrated with Riak and we are working on an interface to SQLite4. Many other projects are using LMDB as well- the complete list can be found at https://symas.com/mdb/#projects.
LMDB几个月前与OpenDKIM集成,现在可以在公共版本中使用。Symas完成了将LMDB与Heimdal Kerberos、Postfix和Cyrus SASL集成在一起的工作,并将结果贡献给了每个项目。我们预计LMDB支持将很快出现在这些包的公共版本中。我们还开发了使用LMDB的MemcacheDB和SQLite3的版本,并将这些代码发布给了Gitorious。有趣的是,根据memcache测试实用程序,带有LMDB的MemcacheDB比仅包含内存的memcache更快。LMDB也与Riak集成在一起,我们正在处理SQLite4的接口。许多其他项目也在使用LMDB——完整的列表可以在https://symas.com/mdb/#项目中找到。

Q: Can LMDB be used with programming languages?
LMDB可以与编程语言一起使用吗?
A: Yes. The major languages- C, C++, Perl, Python, PHP, etc, are all supported now, and many more are coming. The complete list of supported languages is at https://symas.com/mdb/#wrappers.
是的。主要的语言——C、C++、Perl、Python、PHP等等,现在都得到了支持,而且还会有更多的语言出现。支持的语言的完整列表在https://symas.com/mdb/#wrappers。

Q: What comes with LMDB?(与LMDB相关有什么?)
A: Besides the LMDB library itself, we offer mdb_stat, which provides information about the database and its state, and mdb_copy, which creates a complete backup of the database. We also provide the complete API documentation.
除了LMDB库本身之外,我们还提供mdb_stat(提供关于数据库及其状态的信息)和mdb_copy(创建数据库的完整备份)。我们还提供完整的API文档。

Q: How is LMDB licensed?(LMDB许可协议)
A: It's part of OpenLDAP, so it's under the OpenLDAP license- basically a BSD-style license.
它是OpenLDAP的一部分,所以它属于OpenLDAP许可证——基本上是一个BSD风格的许可证。

Q: What about support?相关支持呢?
A: My company, Symas, offers commercial-grade support programs for LMDB to OEMs. We'll also be offering supported versions of memcacheDB and other NoSQL-type databases to end users.
我的公司,Symas,为LMDB提供商业级的支持计划。我们还将为最终用户提供memcacheDB和其它NoSQL类型数据库的支持版本。

About Howard Chu:(关于Howard Chu)
Howard Chu is a co-founder of Symas and serves as its CTO. He is also Chief Architect for the OpenLDAP Project and plays a mean Irish fiddle. You can find out more at highlandsun.com.
Howard Chu是Symas的联合创始人,也是其首席技术官。他也是OpenLDAP项目的首席架构师,一个刻薄的爱尔兰人。可以在highlandsun.com上了解他更多。


参考来源:
Lightning Memory-Mapped Database

lmdb docs for python


最新版本:0.9
该版本默认使用堆分配内存处理;使用之前初始化为 0;增加 MDB_NOMEMINIT 选项来禁用堆内存分配;带来了一些性能方面的提升;修复了一些 64 位 Windows 下的构建问题以及 mdb_page_split() and mdb_cursor_del() 的 bug。

项目主页:https://symas.com/lmdb/