Introduction + s/hash/lookup table/

This commit is contained in:
Philippe Pittoli 2025-04-02 02:36:50 +02:00
parent b910319937
commit df565f0426

View file

@ -75,7 +75,7 @@ Document sync'ed with DODB \*[VERSION]
.br
.po
.B "Status of this document" .
Although fairly advanced, this document lacks a few reviews, a bit of discussion about filesystems (section not entirely finished), to talk about alternatives to DODB and a final conclusion.
Fairly advanced, this document mostly lacks a few reviews.
.SECTION Introduction to DODB
A database consists in managing data, enabling queries to add, to retrieve, to modify and to delete a piece of information.
@ -91,13 +91,13 @@ These two concepts are closely interlinked and require a brief explanation.
are built around the idea to describe data to a database engine so it can optimize operations and storage.
Data is put into
.I tables ,
with each column being an attribute of the stored data and each line being a new entry.
each column being an attribute of the stored data and each line being a new entry.
A database is a list of tables with relations between them.
As an example, let's imagine a database of a movie theater.
The database will have a
.I table
for the list of movies they have
for the list of movies
.PRIMARY_KEY idmovie , (
title, duration, synopsis),
a table for the scheduling
@ -118,7 +118,7 @@ SQL also enables administrative operations on the databases themselves: creating
SQL is used between the application and the database, to perform operations and to provide results when due.
SQL is also used
.UL outside
the application, by admins for managing databases and potentially by
the application by admins to manage databases and potentially by
.I non-developer
users to retrieve data without a dedicated interface\*[*].
.FOOTNOTE1
@ -136,7 +136,7 @@ thus, SQL databases can be scripted to automate operations and to provide a mass
.I "stored procedures" , (
see
.I "PL/SQL" ).
Moreover, the latency between the database and the application makes internet-facing applications require parallelism to handle a high number of clients (or even moderate by today's standards), via multiple threads or concurrent applications.
Moreover, the latency between the database and the application makes internet-facing applications require parallelism to handle a high number of clients (or even moderate by today's standards) via multiple threads or concurrent applications.
Furthermore, writing SQL requests requires a lot of boilerplate since there is no integration in the programming languages, leading to multiple function calls for any operation on the database;
thus, object-relational mapping (ORM) libraries were created to reduce the massive code duplication.
And so on.
@ -170,10 +170,13 @@ Since homogeneity is not necessary anymore, databases have fewer (or different)
Document-oriented databases are a sub-class of key-value stores, where metadata can be extracted from the entries for further optimizations.
And that's exactly what is being done in Document Oriented DataBase (DODB).
.UL "The stated goal of DODB"
is to provide a simple and easy-to-use
.UL library \*[*]
for developers to store documents (undescribed data structures).
.CITATION1
The stated goal of DODB is to provide a
.UL simple
and
.UL easy-to-use
library\*[*] for developers to store documents (undescribed data structures).
.CITATION2
.FOOTNOTE1
Or as people might call it:
.dq "serverless architecture" .
@ -182,34 +185,32 @@ Or as people might call it:
.STARTBULLET
.KS
.BULLET
.B Simple ,
because the approach is indeed trivial: the database entries are written as simple files in a directory.
This simplicity has a snowballing effect: it only requires a few dozen lines of code.
.B "DODB is simple" :
each database entry is written in a plain file, serialized in JSON.
DODB is implemented in only a thousand lines of code in total, despite including optional features and optimized alternative implementations to make the library efficient and cover most cases.
.KE
DODB doesn't strive to be minimalistic, but it avoids intermediary language and low-level optimizations.
Storing data is writing a file.
Indexing data is making symbolic links.
DODB doesn't strive to be minimalistic but avoids intermediary language and low-level optimizations.
Storing data is writing a file, indexing data is making symbolic links.
It is that simple.
.KS
.BULLET
.B Easy-to-use ,
because the API is high-level and doesn't take any superflous parameter.
.B "DODB is easy-to-use" :
the API is high-level and doesn't take any superflous parameter.
Creating a database only requires a path, updating an entry only requires the new version of the entry, and so on.
Everything is designed to be enjoyable for the developers.
.KE
.ENDBULLET
DODB aims for small and medium-size projects\*[*], up to a few hundred million entries with commodity hardware.
In its current form and on commodity hardware, DODB aims for projects with up to a few hundred million entries\*[*] and a few hundred thousand requests per second.
.FOOTNOTE1
There is no real hard limits but the underlying filesystem, DODB can accept billions of entries.
.br
See the section
.dq "Limits of DODB" .
.FOOTNOTE2
Its simplicity (approach and code) makes trivial any modification for specific needs.
However, its simplicity (approach and code) enables quick adaptations for specific needs.
DODB may be a great starting point to implement more sophisticated features for creative minds.
.UL "Contrary to SQL" ,
@ -219,12 +220,13 @@ DODB doesn't provide an interactive shell, there is no request language to perfo
Instead, DODB reduces the complexity of the infrastructure, stores data in plain files and enables simple manual scripting with widespread unix tools.
Simplicity is key.
Traditional SQL relational databases have a snowballing effect on code complexity, even for applications with basic requirements.
Furthermore, data description in tables and relations is not intuitive contrary to storing whole documents which is simply serializing structures used in the code.
Traditional SQL databases have a snowballing effect on code complexity even for applications with basic requirements.
Data description in tables and relations is not intuitive and requires to adapt the application to the database.
DODB stores whole documents instead, which simply means to serialize data structures already used in the application.
.UL "Contrary to other NoSQL databases" ,
DODB isn't an application but a library.
The idea is to help developers to store their data themselves, not depending on
Developers store their data themselves without depending on
. I yet-another-all-in-one
massive tool.
The library writes (and removes) data on a storage device, has a few retrieval and update mechanisms and that's it\*[*].
@ -235,10 +237,9 @@ the feature.
Yet, the tool is expected to be convenient for most applications.
.FOOTNOTE2
Section 2 provides an extensive documentation on how DODB works and how to use it.
This section also presents the concept of "triggers" (automatic actions on database modification).
Section 2 provides an extensive documentation on how DODB works, including how to use DODB and the concept of "triggers" (automatic actions on database modification).
Section 3 introduces caches in both the database and triggers.
Section 4 presents the Common database, an implementation of DODB that should be relevant for most applications.
Section 4 presents the Common database, a storage facility that should be relevant for most applications.
Section 5 presents the RAM-only database, for short-lived (temporary) data.
Section 6 is about memory-constrained environments.
Section 7 presents a few experiments to provide an overview of the performance you can expect from this approach.
@ -250,8 +251,7 @@ Section 12 presents a real-world usage of DODB.
Finally, section 13 provides a conclusion.
.
.SECTION How DODB works and basic usage
DODB is a hash table.
The key of the hash is an auto-incremented number and the value is the stored data.
DODB is a lookup table using an auto-incremented number as a key and the value is the stored data.
This section explains how to use DODB for basic cases including the few added mechanisms to speed-up searches.
Also, the filesystem representation of the data is presented since it enables easy off-application searches.
@ -299,7 +299,7 @@ CBOR
is a work-in-progress.
Nothing binds DODB to a particular format.
.FOOTNOTE2
The key of the hash is a number, auto-incremented, used as the name of the stored file.
The key of the lookup table is an auto-incremented number used as the name of the stored file.
The following example shows the content of the file system after adding the first car.
.TREE1
$ tree db-cars/
@ -310,7 +310,7 @@ db-cars/
.TREE2
In this example, the directory
.I db-cars/data
contains the serialized value, with a formated number as file name.
contains the serialized value with a formated number as file name.
The file "0000000000" contains the following:
.QP
.SOURCE JSON ps=9 vs=10
@ -328,7 +328,7 @@ The car is serialized as expected in the file
.QE
.
.
The key of the entry (its number) is required to retrieve, to modify or to delete it.
The key of the entry (its number) is required to directly modify the database entries.
.
.QP
.SOURCE Ruby ps=9 vs=10
@ -355,13 +355,12 @@ end
.SOURCE
.QE
Of course, browsing the entire database to find a value (or its key) is a waste of resources and isn't practical for any non-trivial database.
That is when indexes come into play.
Data needs to be indexed, which is done in DODB via
.dq triggers .
.
.
.SS Triggers
A simple way to quickly retrieve a piece of data is to create
.I indexes
based on its attributes.
A simple way to quickly retrieve a piece of data is to index it based on its attributes.
When a value is inserted, modified or deleted from the database, an action can be performed automatically thanks to a user recorded callback.
Callbacks are named
.I triggers
@ -559,7 +558,7 @@ Also, this can be as easily hidden in a very nice user-friendly command.
.
.
.SSS Side note about triggers
DODB presents a few possible triggers (basic indexes, partitions and tags) which respond to an obvious need for fast searches and retrevial.
DODB presents a few possible triggers (basic indexes, partitions and tags) which respond to an obvious need for fast searches and retrieval.
Though, the implementation involving an heavy use of the filesystem via the creation of symlinks comes from a certain vision about how a database could behave to provide a practical way for users to query the database
.UL "outside the application" .
@ -583,13 +582,13 @@ The filesystem representation (of data and indexes) is convenient for the admini
Storing the data on a storage device is required to protect it from crashes and application restarts.
But data can be kept in memory for faster processing of requests.
The DODB library has an API close to a hash table.
Having a data cache is as simple as keeping a hash table in memory besides providing a filesystem storage, the retrieval becomes incredibly fast\*[*].
The DODB library has an API close to a lookup table.
Having a data cache is as simple as keeping a lookup table in memory besides providing a filesystem storage, the retrieval becomes incredibly fast\*[*].
.FOOTNOTE1
Several hundred times faster, see the experiment section.
.FOOTNOTE2
Same thing for cached indexes.
Indexes can easily be cached, thanks to simple hash tables.
Indexes can easily be cached, thanks to lookup tables.
.B "Cached database" .
A cached database has the same API as the other DODB databases and keeps a copy of the entire database in memory for fast retrieval.
@ -647,7 +646,7 @@ thus it is moved at the start of the set.
In case the number of entries exceeds what is allowed, the least recently used value is therefore the last element of the set.
.B "Implementation details" .
The LRU strategy is both simple and can be easily implemented efficiently with a double-linked list and a hash table.
The LRU strategy is both simple and can be easily implemented efficiently with a double-linked list and a lookup table.
The implementation is time-efficient;
the time spent adding a value is almost constant, it doesn't change much with the number of entries.
This efficiency is a memory tradeoff.
@ -656,9 +655,9 @@ All the entries are added to a
(to keep track of the order of the added keys)
.UL and
to a
.B "hash table"
.B "lookup table"
to perform efficient searches of the keys in the list.
Thus, all the nodes are added twice, once in the list, once in the hash.
Thus, all the nodes are added twice, once in the list, once in the lookup table.
This way, adding, removing and searching for an entry in the list is fast,
no matter the size of the list.
@ -703,7 +702,7 @@ This includes the same triggers too, so a filesystem representation of the curre
.I RAM-only
also means incredible performances since DODB only is a
.I very
small layer over a hash table.
small layer over a lookup table.
.
.
.SS RAM-only database
@ -865,9 +864,9 @@ The request is a little longer when the index isn't cached (see \f[CW]Uncached d
The logarithmic scale version of this figure shows that \fIRAM-only\f[] and \fIcached\f[] databases have exactly the same performance.
The \fIcommon\f[] database spends 80 ns for its LRU caching eviction policy\*[*], making this database about 67% slower than the previous ones to retrieve a value.
.FOOTNOTE1
The LRU policy in DODB is implemented with a double-linked list and a hash table.
The LRU policy in DODB is implemented with a double-linked list and a lookup table.
When a value is retrieved or modified, its key is put at the start of a list so the list order represents values from the most to the least recently used.
Also, a hash table is maintained to quickly jump to the right list entry.
Also, a lookup table is maintained to quickly jump to the right list entry.
Both these operations take time.
.FOOTNOTE2
Uncached databases are far away from these results, as shown by the logarithmically scaled figure.
@ -943,7 +942,7 @@ The number of cars retrieved scales from 1000 to 5000.
.
Tag and partition indexes request durations are similar because both are fundamentally the same thing:
.ENUM both tag and partition indexes enable to retrieve a list of entries;
.ENUM the keys of the database entries come from listing the content of a directory (uncached indexes) or are directly available from a hash (cached indexes);
.ENUM the keys of the database entries come from listing the content of a directory (uncached indexes) or are directly available from a lookup table (cached indexes);
.ENUM data is retrieved irrespective of the index, it is either read from the storage device or retrieved from a data cache, which depends on the type of database.
.ENDENUM
@ -1014,7 +1013,7 @@ Caching the value enables a massive performance gain, data can be retrieved seve
The more entries requested, the slower it gets; but more importantly, the poorer performances it gets
.UL "per entry" .
The eviction policy implies poorer performances since it requires a few list and hash table operations, even if the current implementation (based on the LRU algorithm) is fairly simple and efficient.
The eviction policy implies poorer performances since it requires a few list and lookup table operations, even if the current implementation (based on the LRU algorithm) is fairly simple and efficient.
To put things into perspective, requesting several thousand entries in DODB based on an index (partition or tag) is as slow as getting
.B "a single entry"
@ -1702,14 +1701,14 @@ Especially given that the number of actual requests is expected to be around 10
Indexes with filesystem representation enables quick debugging sessions and to perform a few basic tasks (such as listing all the domains of a user) which, in practice, is great to have at our fingertips with simple unix tools.
.
.SECTION Conclusion
Thanks its unusual design choices, trading most features for simplicity and letting users implement their own solutions around the few operations provided by DODB (mostly focused on CRUD operations), the complexity of the library is kept at a minimum.
Thus, as far as the author knows, DODB is substantially simpler than any of the other databases, including those in its own category.
Thanks its unusual design choices, trading most features for simplicity and letting users implement their own solutions around the few (mostly focused on CRUD) operations provided by DODB, the complexity of the library is kept at a minimum.
As far as the author knows, DODB is substantially simpler than any of the other databases, including those in the same category.
Despite its simplicity, DODB provides several storage options, from a database without any data cache for very memory-constrained environments to a RAM-based solution for volatile data.
The
.I common
database, an on-disk storage facility with a configurable cache size, should be an acceptable choice for most applications.
The RAM-only database is a great tool for volatile data since it shares the same API with the other databases (thus triggers are available), this enables an unified way to manipulate data in the codebase.
The RAM-only database is a great tool for volatile data since it shares the same API with the other databases (thus triggers are available) which enables an unified way to manipulate all kinds of data in the codebase, not only persistent data that is usually written in a traditional database.
DODB implements
.dq triggers ,
@ -1718,9 +1717,7 @@ a trivial facility to execute code based on database modification (insert, modif
Indexes, partitions and tags are all based on this simple mechanism.
.FOOTNOTE2
Some dedicated
.dq triggers
(Index, Partition and Tags) are implemented in DODB to enable fast retrieval of data based on their attributes (respectively 1-1, 1-n and n-n relations) which, again, should be sufficient for most applications.
Some dedicated triggers (Index, Partition and Tags) are implemented in DODB to enable fast retrieval of data based on their attributes (respectively 1-1, 1-n and n-n relations) which, again, should be sufficient for most applications.
These triggers provide an on-disk representation of the current state of the database to easily manipulate data with simple utilities such as
.dq jq
(data is by default serialized in JSON) or even the usual Unix-y commands (such as
@ -1729,7 +1726,7 @@ These triggers provide an on-disk representation of the current state of the dat
etc.).
DODB even goes as far as implementing those triggers in several ways to further improve performances by adding a data cache or by completely removing the on-disk representation.
DODB is simple enough so any developer can read its entire code source in a morning then start tinkering with it in the afternoon.
DODB is simple enough so any developer can read its entire source code in a morning then start tinkering with it in the afternoon.
The storage codebase only has 457 lines of code (including all the storage options) and the triggers codebase only has 520 lines of code (including all the triggers and their alternative implementations).
Thus, anyone can shape the code to their liking or add alternative options for storage and triggers: be creative!
The very few implementation choices (JSON, a single document per file, symlinks for indexes, partitions and tags) can be changed in a matter of hours.
@ -1737,9 +1734,9 @@ The very few implementation choices (JSON, a single document per file, symlinks
DODB has been used in a real-life scenario for the netlibre project enabling the developer to focus on core features of the application.
For example, database management only took a few dozen lines of code on a 3 kLOC project (dnsmanagerd), most of them being to setup the different databases (storage and triggers) and the rest to perform CRUD operations (each of them only requiring a single line of code).
DODB won't power the next
In its current form, DODB won't power the next
.I "AI thing" ,
it will never handle databases with petabytes of data nor revolutionize cryptocurrency.
it won't handle databases with petabytes of data nor revolutionize cryptocurrency.
However, DODB may be a better fit than traditional databases for your next blog, your gaming forum, the software forge of your dreams and maybe your future MMORPG (if you're skilled enough).
.APPENDIX LRU vs Efficient LRU