enables arbitrary operations on databases: add, search, modify and delete entries.
Furthermore, SQL also enables to manage administrative operations of the databases themselves: creating databases and tables, managing users with fine-grained authorizations, etc.
For example, designing databases becomes difficult when the list of tables grows;
Unified Modeling Language (UML) is then used to provide a graphical overview of the relations between tables.
SQL databases may be fast to retrieve data despite complicated operations, but when multiple sequential operations are required they become slow because of all the back-and-forths with the application;
thus, SQL databases can be scripted to automate operations and provide a massive speed up
Writing SQL requests requires a lot of boilerplate since there is no integration in the programming languages, leading to multiple function calls for any operation on the database;
The encountered difficulties mentioned above and the original objectives of SQL not being universal\*[*], other database designs were created\*[*].
.FOOTNOTE1
To say the least!
Not everyone needs to let users access the database without going through the application.
For instance, writing a \f[I]blog\f[] for a small event or to share small stories about your life doesn't require manual operations on the database, fortunately.
.FOOTNOTE2
.FOOTNOTE1
A lot of designs won't be mentioned here.
The actual history of databases is often quite unclear since the categories of databases are sometimes vague, underspecified.
As mentioned, SQL is not a silver bullet and a lot of developers shifted towards other solutions, that's the important part.
In this way, DODB transforms any application in a database manager.
DODB doesn't provide an interactive shell, there is no request language to perform arbitrary operations on the database, no statistical optimizations of the requests based on query frequencies, etc.
Instead, DODB reduces the complexity of the infrastructure, stores data in plain files and enables simple manual scripting with widespread unix tools.
The presented code is in Crystal such as the DODB library for now, but keep in mind that this document is all about the method more that the actual implementation, anyone could implement the exact same library in almost every other language.
DODB provides basic database operations such as storing, searching, modifying and removing data.
Though, SQL databases have a few
.I properties
enabling a more standardized behavior and may create some expectations towards databases from a general public standpoint.
These properties are called "ACID": atomicity, consistency, isolation and durability.
DODB doesn't fully handle ACID properties.
DODB doesn't provide
.I atomicity .
Instructions cannot be chained and rollback if one of them fails.
DODB doesn't handle
.I consistency .
There is currently no mechanism to prevent adding invalid values.
.I Isolation
is partially taken into account with a locking mechanism preventing race conditions.
Though, parallelism is mostly required to respond to a large number of clients at the same time.
Also, SQL databases require a communication with an inherent latency between the application and the database, slowing down the requests despite the fast algorithms to search for a value within the database.
Parallelism is required for SQL databases because of this latency (at least partially), which doesn't exist with DODB\*[*].
.FOOTNOTE1
FYI, the service
.I netlib.re
uses DODB and since the database is fast enough, parallelism isn't required despite enabling more than a thousand requests per second.
.FOOTNOTE2
With a cache, data is retrieved five hundred times quicker than with a SQL database.
.I Durability
is taken into account.
Data is written on disk each time it changes.
Again, this is basic but
.SHINE "good enough"
for most applications.
.B "Discussion on ACID properties" .
The author of this document sees these database properties as a sort of "fail-safe".
Always nice to have, but not entirely necessary; at least not for every single application.
DODB will provide some form of atomicity and consistency at some point, but nothing fancy nor too advanced.
The whole point of the DODB project is to keep the code simple (almost
.B "stupidly"
simple).
Thus, managing or not these properties isn't a limitation of the DODB approach but a choice for this specific project.
Not handling all the ACID properties within the DODB library doesn't mean they cannot be achieved.
Applications can have these properties with a few lines of code.
The following experiment shows the performance of DODB based on quering durations.
Data can be searched via
.I indexes ,
as for SQL databases.
Three possible indexes exist in DODB:
(a) basic indexes, representing 1 to 1 relations, the document's attribute is related to a value and each value of this attribute is unique,
(b) partitions, representing 1 to n relations, the attribute has a value and this value can be shared by other documents,
(c) tags, representing n to n relations, enabling the attribute to have multiple values whose are shared by other documents.
The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
Loop and repeat.
Four instances of DODB are tested:
.BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory) ;
.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes ;
.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*] ;
.BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
.ENDBULLET
.FOOTNOTE1
Having a cached database will probably be the most widespread use of DODB.
When memory isn't scarce, there is no point not using it to achieve better performance.
.FOOTNOTE2
The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
.I disk
is actually a
.I "temporary file-system (tmpfs)"
to enable maximum efficiency.
.FOOTNOTE1
A very simple $50 PC, buyed online.
Nothing fancy.
.FOOTNOTE2
The library is written in Crystal and so is the benchmark (\f[CW]spec/benchmark-cars.cr\f[]).
Nonetheless, despite a few technicalities, the objective of this document is to provide an insight on the approach used in DODB more than this particular implementation.
The manipulated data type can be found in \f[CW]spec/db-cars.cr\f[].
Since there is only one value to retrieve, the request is quick and time is almost constant.
When the value and the index are kept in memory (see \f[CW]RAM only\f[] and \f[CW]Cached db\f[]), the retrieval is almost instantaneous (about 50 to 120 ns).
In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db, cached index\f[]).
The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.
.TS
allbox tab(:);
c | lw(4.0i) | cew(1.4i).
DODB instance:Comment and database usage:T{
compared to RAM only
T}
RAM only:T{
Worst memory footprint (all data must be in memory), best performance.
T}:-
Cached db and index:T{
Performance for retrieving a value is the same as RAM only while
enabling the admin to manually search for data on-disk.
T}:about the same perfs
Uncached db, cached index::300 to 400x slower
Uncached db and index:T{
Best memory footprint, worst performance.
T}:400 to 500x slower
.TE
.B Conclusion :
as expected, retrieving a single value is fast and the size of the database doesn't matter much.
Each deserialization and, more importantly, each disk access is a pain point.
Caching the value enables a massive performance gain, data can be retrieved several hundred times quicker.