Limits of DODB++.

This commit is contained in:
Philippe Pittoli 2025-02-05 06:42:23 +01:00
parent f4ab8154f9
commit 5dbc282027

View file

@ -998,12 +998,13 @@ With Postgres, the request duration of a single value varies from 0.1 to 2 ms on
.
.
.SECTION Limits of DODB
DODB provides basic database operations such as storing, retrieving, modifying and removing data.
However, DODB doesn't fully handle ACID properties\*[*]: atomicity, consistency, isolation and durability.
This section presents the limits of DODB, whether the current implementation or the approach, and presents some suggestions to fill the gaps.
DODB provides basic database operations such as storing, retrieving, modifying and removing data but doesn't fully handle ACID properties nor a few other aspects generally associated with databases\*[*].
.FOOTNOTE1
Traditional SQL databases handle ACID properties and may have created some "expectations" towards databases from a general public standpoint.
Traditional SQL databases may have created some "expectations" towards databases from a general public standpoint, such as the ACID properties (atomicity, consistency, isolation and durability), transactions and replication.
.FOOTNOTE2
This section presents the limits of DODB, whether the current implementation or the approach.
The state of filesystems will be discussed since DODB heavily relies on the underlying filesystem.
Finally, this section presents some suggestions to fill the gaps with traditional databases on a few points.
.SS "Current state of DODB regarding ACID properties"
.STARTBULLET
@ -1046,8 +1047,9 @@ Data is written on disk each time it changes.
Again, this is basic but
.SHINE "good enough"
for most applications.
A future improvement could be to write a checksum for every file to detect corrupt data, but this overlaps with some filesystems which already provide this feature.
.ENDBULLET
A future improvement could be to write a checksum for every written data, to easily remove corrupt data from a database.
.SS "Discussion on ACID properties"
First and foremost, both atomicity and isolation properties are inherently related to parallelism, whether through concurrent threads or applications.
@ -1058,13 +1060,13 @@ Therefore, DODB could theoretically serve millions of requests per second from a
.FOOTNOTE1
FYI, the service
.I netlib.re
uses DODB and since the database is fast enough, parallelism isn't required despite enabling several thousand requests per second.
uses DODB and since the database is fast enough, parallelism isn't required despite enabling several thousand requests per second in a virtual machine on a low-end hardware released almost two decades ago.
.FOOTNOTE2
Considering this swiftness, parallelism may seem as optional.
The consistency property is a safety net for potentially defective software.
Always nice to have, but not entirely necessary, especially for document-oriented databases.
Contrary to a traditional SQL database which often requires several modifications to different tables in one go to be kept consistent, a document-oriented database stores an entire document which already is internally consistent.
Contrary to a traditional SQL database which often requires several modifications of different tables in one go to be kept consistent, a document-oriented database stores an entire document which already is internally consistent.
When several documents are involved (which happens from time to time), consistency needs to be checked, but this may not require much code\*[*].
Not checking systematically for consistency upon any database modification is a tradeoff between simplicity of the code plus speed, and security.
.FOOTNOTE1
@ -1074,8 +1076,9 @@ Database verifications are just the last bastion against inserting junk data.
Moreover, the consistency property in traditional SQL databases is often used for simple tasks but quickly becomes difficult to deal with.
Some companies and organizations (such as Doctors Without Borders for instance) cannot afford to implement all the preventive measures in their DBMSs due to the sheer complexity of it.
Instead, these organizations adopt curative measures that they may call "data-fix".
Thus, having some verifications in the database is not a silver bullet, it is complementary to other measures.
Instead, these organizations adopt curative measures that they may call
.dq data-fix .
Having verifications in the database is not a silver bullet but a complementary measure at most.
DODB may provide some form of atomicity and consistency at some point, but nothing fancy nor too advanced.
The whole point of the DODB project is to keep the code simple, hackable, enjoyable even.
@ -1086,7 +1089,36 @@ Which also results from a lack of time.
.SS "Beyond ACID properties \[en] modern databases' features"
Most current databases (traditional relational databases, some key-value databases and so on) provide additional features.
These features may include for example high availability toolsets (replication, clustering, etc.), some forms of modularity (several storage backends, specific interfaces with other tools, etc.), interactive command lines or shells, user and authorization management, administration of databases, and so on.
.STARTBULLET
.KS
.BULLET
.B "High availability toolsets"
(replication, clustering, etc.).
Well, this simply doesn't match with DODB goals to provide a database for small projects.
These tools imply an unreasonable amount of code compared to the current DODB library.
.KE
However, some of these features could be provided by the filesystem itself.
.KS
.BULLET
.B Modularity
(several storage backends, specific interfaces with other tools, etc.).
.KE
.KS
.BULLET
.B "Interactive management"
(through command lines or a dedicated shell).
.KE
.KS
.BULLET
.B "Database administration"
(CRUD on databases themselves, user and authorization management, etc.).
.KE
.ENDBULLET
Because DODB is a library and doesn't support an intermediary language for generic requests,
.TBD
@ -1198,7 +1230,7 @@ Some filesystems added more than a decade ago then under-explored features such
.ds NOK \[tmu]
.nr total 16.0c
.nr col1 3.0c
.nr col2 (\n[total]-\n[col1])/3
.nr col2 (\n[total]-\n[col1])/6
.nr col3 (\n[total]-\n[col1]-\n[col2])
.\"total: \n[total]
.\"col1: \n[col1]
@ -1208,24 +1240,51 @@ Some filesystems added more than a decade ago then under-explored features such
allbox tab(:);
c | c | c
cw(\n[col1]u) | lw(\n[col2]u) | lw(\n[col3]u).
Feature : Traditional databases : Filesystems
CRUD operations : SQL : Files & directories
Feature : DBMS : Filesystems
CRUD operations : SQL :files & directories
Atomicity : \*[OK] :T{
transactions are implemented in a few filesystems (ex: BTRFS)
and there is a locking mechanism based on files
locking mechanism based on files
T}
Consistency : \*[OK] : \*[NOK]
Isolation : \*[OK] :T{
.dq "new file then mv"
technique
T}
Durability : \*[OK] : yes (checksums)
Access Time : 0.1 to 2ms : a few µs (cache) to a few ms (first access)
Transactions : :
Durability : \*[OK] :limited (checksums)
Access Time : 0.1 to 2ms :a few µs (cache) to a few ms (first access with a hard disk)
Transactions : \*[OK] :T{
implemented in a few filesystems (BTRFS, ZFS)
T}
Performance : \*[OK] :T{
B trees and variants (used in all modern FS: BTRFS, ext4, Raiserfs4, NTFS, HAMMER…) are used to search data on the storage device but also to get an entry in a huge directory.
T}
Space waste :T{
almost none
.ps
T}:T{
depends on many factors, but generally important
T}
.TE
In conclusion, no current filesystem has been designed to be used the way DODB use them.
However, having a few millions entries is fine on most filesystems.
.B "Conclusion" .
The difference between the feature set of traditional databases and filesystems slightly narrowed over time.
The discrepancy will always be there since they do not share the same goal, yet some features overlap.
Even though no current filesystem has been designed to be used the way DODB use them, this kind of database system can profit from some
.dq recent
developments in the filesystem world (such as transactions).
The codebase size (and complexity) necessary to create a database system that provides acceptable performances for a small project \*[*] shrunk drastically thanks to hardware and filesystem developments.
.FOOTNOTE1
Beside CRUD operations, a small project could imply basic relations between data, some simple transactions, a few databases (or
.I tables
in DBMS jargon) and a few thousand operations per second.
Both relations and transactions could be handled by the application, not necessarily by the database system itself.
.FOOTNOTE2
Performance is simply not a problem for most use.
Having a directory with a few million entries is fine on modern filesystems.
The access time is slow (a few ms) only on the first access, the kernel
.B automatically
caches accessed files, then we are talking about a few dozen µs which is virtually nothing.
.
.
.SECTION Alternatives