[Buildroot] [PATCH v5 0/1] cover letter for package/libmdbx: new package (library/database)

Leonid Yuriev leo at yuriev.ru
Fri Nov 27 13:47:11 UTC 2020


This patch adds libmdbx v0.9.2 and below is a brief overview of libmdbx.
Please merge.

Regards,
Leonid.

--

libmdbx is an extremely fast, compact, powerful, embedded, transactional
key-value database, with permissive license. libmdbx has a specific set
of properties and capabilities, focused on creating unique lightweight
solutions.

Historically, libmdbx (MDBX) is a deeply revised and extended descendant
of the legendary LMDB (Lightning Memory-Mapped Database). libmdbx
inherits all benefits from LMDB, but resolves some issues and adds a set
of improvements.

According to developers, for now libmdbx surpasses the LMDB in terms of
reliability, features and performance.


The most important differences MDBX from LMDB:
==============================================

1. More attention is paid to the quality of the code, to an
"unbreakability" of the API, to testing and automatic checks (i.e.
sanitizers, etc). So there:
 - more control during operation;
 - more checking parameters, internal audit of database structures;
 - no warnings from compiler;
 - no issues from ASAN, UBSAN, Valgrind, Coverity;
 - etc.

2. Keys could be more than 2 times longer than LMDB.

3. Up to 20% faster than LMDB in CRUD benchmarks.

4. Automatic on-the-fly database size adjustment,
both increment and reduction.

5. Automatic continuous zero-overhead database compactification.

6. The same database format for 32- and 64-bit builds.

7. LIFO policy for Garbage Collection recycling (this can significantly
increase write performance due write-back disk cache up to several times
in a best case scenario).

8. Range query estimation.

9. Utility for checking the integrity of the database structure with
some recovery capabilities.

For more info please refer:
 - https://github.com/erthink/libmdbx for source code and README.
 - https://erthink.github.io/libmdbx for API description.

--

MDBX is a Btree-based database management library modeled loosely on the
BerkeleyDB API, but much simplified. The entire database (aka
"environment") is exposed in a memory map, and all data fetches return
data directly from the mapped memory, so no malloc's or memcpy's occur
during data fetches. As such, the library is extremely simple because it
requires no page caching layer of its own, and it is extremely high
performance and memory-efficient. It is also fully transactional with
full ACID semantics, and when the memory map is read-only, the database
integrity cannot be corrupted by stray pointer writes from application
code.

The library is fully thread-aware and supports concurrent read/write
access from multiple processes and threads. Data pages use a
copy-on-write strategy so no active data pages are ever overwritten,
which also provides resistance to corruption and eliminates the need of
any special recovery procedures after a system crash. Writes are fully
serialized; only one write transaction may be active at a time, which
guarantees that writers can never deadlock. The database structure is
multi-versioned so readers run with no locks; writers cannot block
readers, and readers don't block writers.

Unlike other well-known database mechanisms which use either write-ahead
transaction logs or append-only data writes, MDBX requires no
maintenance during operation. Both write-ahead loggers and append-only
databases require periodic checkpointing and/or compaction of their log
or database files otherwise they grow without bound. MDBX tracks
retired/freed pages within the database and re-uses them for new write
operations, so the database size does not grow without bound in normal
use.

The memory map can be used as a read-only or read-write map. It is
read-only by default as this provides total immunity to corruption.
Using read-write mode offers much higher write performance, but adds the
possibility for stray application writes thru pointers to silently
corrupt the database.


Features
========

- Key-value data model, keys are always sorted.

- Fully ACID-compliant, through to MVCC and CoW.

- Multiple key-value sub-databases within a single datafile.

- Range lookups, including range query estimation.

- Efficient support for short fixed length keys, including native
  32/64-bit integers.

- Ultra-efficient support for multimaps. Multi-values sorted, searchable
  and iterable. Keys stored without duplication.

- Data is memory-mapped and accessible directly/zero-copy. Traversal of
  database records is extremely-fast.

- Transactions for readers and writers, ones do not block others.

- Writes are strongly serialized. No transaction conflicts nor
  deadlocks.

- Readers are non-blocking, notwithstanding snapshot isolation.

- Nested write transactions.

- Reads scale linearly across CPUs.

- Continuous zero-overhead database compactification.

- Automatic on-the-fly database size adjustment.

- Customizable database page size.

- Olog(N) cost of lookup, insert, update, and delete operations by
  virtue of B+ tree characteristics.

- Online hot backup.

- Append operation for efficient bulk insertion of pre-sorted data.

- No WAL nor any transaction journal. No crash recovery needed. No
  maintenance is required.

- No internal cache and/or memory management, all done by basic OS
  services.


Limitations
===========

- Page size: a power of 2, maximum 65536 bytes, default 4096 bytes.

- Key size: minimum 0, maximum ≈¼ pagesize (1300 bytes for default 4K
  pagesize, 21780 bytes for 64K pagesize).

- Value size: minimum 0, maximum 2146435072 (0x7FF00000) bytes for maps,
  ≈¼ pagesize for multimaps (1348 bytes default 4K pagesize, 21828 bytes
  for 64K pagesize).

- Write transaction size: up to 4194301 (0x3FFFFD) pages (16 GiB for
  default 4K pagesize, 256 GiB for 64K pagesize).

- Database size: up to 2147483648 pages (8 TiB for default 4K pagesize,
  128 TiB for 64K pagesize).

- Maximum sub-databases: 32765.


Gotchas
=======

- There cannot be more than one writer at a time, i.e. no more than one
  write transaction at a time.

- libmdbx is based on B+ tree, so access to database pages is mostly
  random. Thus SSDs provide a significant performance boost over
  spinning disks for large databases.

- libmdbx uses shadow paging instead of WAL. Thus syncing data to disk
  might be a bottleneck for write intensive workload.

- libmdbx uses copy-on-write for snapshot isolation during updates, but
  read transactions prevents recycling an old retired/freed pages, since
  it read ones. Thus altering of data during a parallel long-lived read
  operation will increase the process work set, may exhaust entire free
  database space, the database can grow quickly, and result in
  performance degradation. Try to avoid long running read transactions.

- libmdbx is extraordinarily fast and provides minimal overhead for data
  access, so you should reconsider using brute force techniques and
  double check your code. On the one hand, in the case of libmdbx, a
  simple linear search may be more profitable than complex indexes. On
  the other hand, if you make something suboptimally, you can notice
  detrimentally only on sufficiently large data.

--
Leonid Yuriev (1):
  package/libmdbx: new package (library/database).

 DEVELOPERS                   |  3 +++
 package/Config.in            |  1 +
 package/libmdbx/Config.in    | 45 ++++++++++++++++++++++++++++++++++++
 package/libmdbx/libmdbx.hash |  5 ++++
 package/libmdbx/libmdbx.mk   | 33 ++++++++++++++++++++++++++
 5 files changed, 87 insertions(+)
 create mode 100644 package/libmdbx/Config.in
 create mode 100644 package/libmdbx/libmdbx.hash
 create mode 100644 package/libmdbx/libmdbx.mk

-- 
2.29.2



More information about the buildroot mailing list