PAPER: conclusion and minor improvements.

This commit is contained in:
Philippe Pittoli 2025-03-31 06:06:03 +02:00
parent 69fc674a2a
commit b910319937

View file

@ -244,9 +244,10 @@ Section 6 is about memory-constrained environments.
Section 7 presents a few experiments to provide an overview of the performance you can expect from this approach.
Section 8 describes the limitations of DODB and its current implementation.
Section 9 presents the related work, alternative approaches and implementations.
Section 10 lays out future work on this project.
Section 11 presents a real-world usage of DODB.
Finally, section 12 provides a conclusion.
Section 10 gives an overview of the current state of affairs regarding security in DODB.
Section 11 lays out future work on this project.
Section 12 presents a real-world usage of DODB.
Finally, section 13 provides a conclusion.
.
.SECTION How DODB works and basic usage
DODB is a hash table.
@ -1500,19 +1501,31 @@ are complex KV stores with a lot of features, including support for many typed d
Many other KV stores can be mentioned, such as
.B LevelDB
(embedded),
(embedded, log-structured merge-tree \*[*]),
.FOOTNOTE1
The
.dq "log-structured merge-tree"
algorithm is a simple improvement for write-intensive databases (handling logs for instances).
Newer data are first stored in RAM then
.UL sequentially
written on the disk (high throughput).
.FOOTNOTE2
and
.B RocksDB
(fork of LevelDB with added features, such as transactions, snapshots, bloom filters, optimizations for multi-CPUs, etc.),
.B Cassandra
(log-structured merge-tree, written in Java),
.B ScyllaDB
(somehow a C++ rewrite of Cassandra with modern optimization techniques),
.B CockroachDB
(proprietary, distributed, ACID transactions), etc.
Features vary, but all these implementations of KV stores are actually efficient on data retrievial compared to SQL databases.
Features vary, but all these implementations of KV stores are actually very efficient on CRUD operations compared to SQL databases.
.KS
.BULLET
.B "Document databases" .
Many other document-oriented databases exist beside DODB.
Many document-oriented databases exist beside DODB.
For example,
.B CouchDB
(distributed, fault-tolerant, RESTful HTTP and JSON API…),
@ -1520,26 +1533,10 @@ For example,
(proprietary, ACID transactions, replication…),
.B UnQlite
(embedded, ACID transactions, embedded scripting language…).
As far as the author knows, none of them is as simple as DODB.
.KE
.ENDBULLET
.B Cassandra
.TBD
.
.SECTION Future work
This section presents all the features I want to see in a future version of the DODB library.
.
.SS Pagination via the indexes: offset and limit
Right now, browsing the entire database by requesting a limited list at a time is possible, thanks to some functions accepting an
.I offset
and a
.I size .
However, this is not possible with the indexes, thus when querying for example a partition the API provides the entire list of matching values.
This is not acceptable for databases with large partitions and tags: memory will be over-used and requests will be slow.
.
.SS DODB and security
.SECTION DODB and security
Right now, security isn't managed in DODB, at all.
DODB isn't vulnerable to SQL injections, but an internet-facing application may encounter a few other problems including, but not limited to, buffer overflows and code injection.
However, a few security mechanisms exist to prevent data leak or data modification from an outsider and the DODB library may implement some of them in the future.
@ -1602,22 +1599,48 @@ The design of these functions is simple: applications often have an
phase during which the connections are made or files are opened (including configuration files),
then comes the
.I running
phase, during which the application needs less priviledges.
phase, during which the application needs fewer priviledges.
Therefore, an application can access whatever it needs for its initialization phase, which is less prone to attacks, then restricts its own rights over syscalls and files before accepting connections from the internet.
For example, a web server can read its configuration file to learn the path to the files to serve, then prevents itself from accessing any other file (including its own configuration file) before serving the files.
For example, a web server can read its configuration file to learn the path to the files to serve (the websites) then prevents itself from accessing any other file (including its own configuration file) before serving the websites.
In-app mechanisms such as these greatly simplifies the configuration.
Security parameters related to the filesystem don't require to be sync with the configuration of the application.
Also, any syscall that is irrelevent for the
.I running
phase can be disallowed without fuss, which makes pledge+unveil inherently safer than AppArmor and the like.
.ENDBULLET
Common to all the above mechanisms (AppArmor and pledge+unveil), by default, without taking a deep dive into software architecture, none of these prevents a user from accessing the entirety of the database.
A malicious user who successfully took control of the application can now open files (at least in the DODB directory) and read the application's memory (including cached data).
.
.
.SECTION Future work
This section presents all the features I want to see in a future version of the DODB library.
.
.SS New types of storage facility
The Log-Structured Merge-Tree algorithm is interesting for databases with intensive updates.
Database modifications are deferred to be written sequentially, greatly improving the throughput.
Implementing this algorithm in DODB (while still keeping an eye on the code complexity) could open new possibilities, bringing DODB to a new class of usage.
Storing data in separate files as it's currently done is great in many aspects but becomes cumbersome with large databases.
One way to enable large databases in DODB could be to add a new storage class which works differently, but would inevitably introduce complexity.
Another way could be to implement a new file-system dedicated to store a massive number of small files, which ultimately is more interesting than adding complexity to the library and may be useful beyond DODB.
.
.SS New types of triggers
Some operations are (rightfully!) not handled in DODB, such as text searches.
Triggers could be implemented to provide data to external tools in order to enable such operations.
Also,
.I "analytical triggers"
could be implemented to provide statistics about the database usage by adding triggers that are activated on database access, not modification.
.SS Pagination via the indexes: offset and limit
Right now, browsing the entire database by requesting a limited list at a time is possible, thanks to some functions accepting an
.I offset
and a
.I size .
However, this is not possible with the indexes, thus when querying for example a partition the API provides the entire list of matching values.
This is not acceptable for databases with large partitions and tags: memory will be over-used and requests will be slow.
.
.SECTION Real-world usage: netlibre
DODB instances have been deployed in a real-world setting by the netlibre service.
This section presents this service and its use of DODB, showing how this method of handling data can be used in conventional online services.
@ -1677,29 +1700,60 @@ It's almost as the application intentionally avoids any possible optimization.
Especially given that the number of actual requests is expected to be around 10 requests per second on busy days.
.FOOTNOTE2
Indexes with filesystem representation enables quick debugging sessions and to perform a few basic tasks (such as listing all the domains of a user) which, in practice, is great to have at our fingertips with simple unix tools.
.
.SECTION Conclusion
Thanks its unusual design choices, trading most features for simplicity and letting users implement their own solutions around the few operations provided by DODB (mostly focused on CRUD operations), the complexity of the library is kept at a minimum.
Thus, as far as the author knows, DODB is substantially simpler than any of the other databases, including those in its own category.
Despite its simplicity, DODB provides several storage options, from a database without any data cache for very memory-constrained environments to a RAM-based solution for volatile data.
The
.I common
database should be an acceptable choice for most applications.
database, an on-disk storage facility with a configurable cache size, should be an acceptable choice for most applications.
The RAM-only database is a great tool for volatile data since it shares the same API with the other databases (thus triggers are available), this enables an unified way to manipulate data in the codebase.
.STARTBULLET
.BULLET it is possible to write other triggers to replace the way index, partition and tags are working, and store their data differently, possibly in a flat file for example.
.BULLET talk about netlib.re
.BULLET triggers are available and can be adapted to do anything, indexes are a simple use of triggers
.BULLET common db is great for most applications
.BULLET indexes can provide a view of the current state of the database
.BULLET ramdb is a great tool, same API than the rest so you can attach indexes to it
.ENDBULLET
DODB implements
.dq triggers ,
a trivial facility to execute code based on database modification (insert, modification and removal operations), enabling all kinds of things in a standardized way\*[*] and can be used by anyone to adapt DODB to their own needs.
.FOOTNOTE1
Indexes, partitions and tags are all based on this simple mechanism.
.FOOTNOTE2
Some dedicated
.dq triggers
(Index, Partition and Tags) are implemented in DODB to enable fast retrieval of data based on their attributes (respectively 1-1, 1-n and n-n relations) which, again, should be sufficient for most applications.
These triggers provide an on-disk representation of the current state of the database to easily manipulate data with simple utilities such as
.dq jq
(data is by default serialized in JSON) or even the usual Unix-y commands (such as
.I ls ,
.I cd ,
etc.).
DODB even goes as far as implementing those triggers in several ways to further improve performances by adding a data cache or by completely removing the on-disk representation.
DODB is simple enough so any developer can read its entire code source in a morning then start tinkering with it in the afternoon.
The storage codebase only has 457 lines of code (including all the storage options) and the triggers codebase only has 520 lines of code (including all the triggers and their alternative implementations).
Thus, anyone can shape the code to their liking or add alternative options for storage and triggers: be creative!
The very few implementation choices (JSON, a single document per file, symlinks for indexes, partitions and tags) can be changed in a matter of hours.
DODB has been used in a real-life scenario for the netlibre project enabling the developer to focus on core features of the application.
For example, database management only took a few dozen lines of code on a 3 kLOC project (dnsmanagerd), most of them being to setup the different databases (storage and triggers) and the rest to perform CRUD operations (each of them only requiring a single line of code).
DODB won't power the next
.I "AI thing" ,
it will never handle databases with a petabyte of data nor revolutionize cryptocurrency.
However, DODB may be a better fit than traditional databases for your next blog, your gaming forum, the software forge of your dreams and maybe your future MMORPG.
.TBD
it will never handle databases with petabytes of data nor revolutionize cryptocurrency.
However, DODB may be a better fit than traditional databases for your next blog, your gaming forum, the software forge of your dreams and maybe your future MMORPG (if you're skilled enough).
.APPENDIX LRU vs Efficient LRU
DODB uses a Least Recently Used algorithm for the eviction policy of data cache in the
.I Common
database.
In the pursuit of simplicity, I implemented this algorithm in 3 lines of code (19 lines for the whole class) which only uses a dynamic array.
This overly-simplistic implementation has an
.I 0(n)
complexity which isn't acceptable for a real-life scenario.
Thus, I then implemented a more efficient LRU algorithm in 16 LOC (31 LOC for the whole class) which has essentially a constant
.I 0(1)
complexity and uses a Double Linked List (231 LOC) and a lookup table (the Hash class provided by the Crystal standard library).
The following figure presents a performance comparison of both implementations.
.ps -2
.so graphs/addition_lru.grap
.ps \n[PS]
@ -1735,7 +1789,7 @@ delim off
This figure shows the request durations to retrieve data based on a partition containing up to 10k entries.
.QE
As we see in the figure, the duration for data retrieval grows almost linearly for databases with a sufficient cache size (starting with 10k entries).
When the cache size is not sufficient, the requests are hundred times slower, which explain why the database with a cache size of one thousand entries isn't even represented in the graph, and why the 5k database has great results up to 5k partitions.
When the cache size is not sufficient, the requests are hundred times slower, which explains why the database with a cache size of one thousand entries isn't even represented in the graph, and why the 5k database has great results up to 5k partitions.
.ps -2
.so graphs/lru_query_tag.grap
.ps \n[PS]
@ -1793,7 +1847,7 @@ end
.SOURCE
.QE
.
.SS Database search, update and deletion with the key (integer associated to the value)
.SS Data retrieval, update and deletion with the key (integer associated to the value)
.KS
.QP
.SOURCE Ruby ps=9 vs=10
@ -1836,7 +1890,7 @@ cars_by_keywords = cars.new_RAM_tags "keywords", &.keywords
.QE
.
.
.SS Database retrieval, update and deletion with an index
.SS Data retrieval, update and deletion with an attribute (index, partition, tags)
.
.QP
.SOURCE Ruby ps=9 vs=10