enables arbitrary operations on databases: add, search, modify and delete entries.
Furthermore, SQL also enables to manage administrative operations of the databases themselves: creating databases and tables, managing users with fine-grained authorizations, etc.
For example, designing databases becomes difficult when the list of tables grows;
Unified Modeling Language (UML) is then used to provide a graphical overview of the relations between tables.
SQL databases may be fast to retrieve data despite complicated operations, but when multiple sequential operations are required they become slow because of all the back-and-forths with the application;
thus, SQL databases can be scripted to automate operations and provide a massive speed up
Writing SQL requests requires a lot of boilerplate since there is no integration in the programming languages, leading to multiple function calls for any operation on the database;
The encountered difficulties mentioned above and the original objectives of SQL not being universal\*[*], other database designs were created\*[*].
.FOOTNOTE1
To say the least!
Not everyone needs to let users access the database without going through the application.
For instance, writing a \f[I]blog\f[] for a small event or to share small stories about your life doesn't require manual operations on the database, fortunately.
.FOOTNOTE2
.FOOTNOTE1
A lot of designs won't be mentioned here.
The actual history of databases is often quite unclear since the categories of databases are sometimes vague, underspecified.
As mentioned, SQL is not a silver bullet and a lot of developers shifted towards other solutions, that's the important part.
In this way, DODB transforms any application in a database manager.
DODB doesn't provide an interactive shell, there is no request language to perform arbitrary operations on the database, no statistical optimizations of the requests based on query frequencies, etc.
Instead, DODB reduces the complexity of the infrastructure, stores data in plain files and enables simple manual scripting with widespread unix tools.
The presented code is in Crystal such as the DODB library for now, but keep in mind that this document is all about the method more that the actual implementation, anyone could implement the exact same library in almost every other language.
DODB presents a few possible indexes (basic indexes, partitions and tags) which respond to an obvious need for fast searches.
Though, their implementation via the creation of symlinks is the result of a certain vision about how a database should behave in order to provide a practical way for users to sort the entries.
The file-system representation (of data and indexes) is convenient for the administrator, but input-output operations on a file-system are slow.
Storing the data on a storage device is required to protect it from crashes and application restarts.
But data can be kept in memory for faster processing of requests.
The DODB library has an API close to a hash table.
Having a data cache is as simple as keeping a hash table in memory besides providing a file-system storage, the retrieval becomes incredibly fast\*[*].
.FOOTNOTE1
Several hundred times faster, see the experiment section.
.FOOTNOTE2
Same thing for cached indexes.
Indexes can easily be cached, thanks to simple hash tables.
.
.
.SS Cached database
A cached database has the same API as the other DODB databases.
Databases are built around the objective to actually
.I store
data.
But sometimes the data has only the same lifetime as the application.
Stop the application and the data itself become irrelevant, which happens in several occasions, for instance when the application keeps track of the connected users.
This case is not covered by traditional databases; this is out-of-scope, short-lived data only is handled within the application.
Yet, since DODB is a library and not a separate application (read: DODB is incredibly faster), this usage of the database can be relevant.
Having the same API to handle both long and short-lived data can be useful.
Moreover, the previously mentioned indexes (basic indexes, partitions and tags) would also work the same way for these short-lived data.
Of course, in this case, the file-system representation may be completely irrelevant.
And for all these reasons, the
.I RAM-only
DODB database and
.I RAM-only
indexes were created.
Let's recap the advantages of the RAM-only DODB database.
The DODB API is the same for short-lived (read: temporary) and long-lived data.
This includes the same indexes too, so a file-system representation of the current state of the application is possible.
RAM-only also means incredible performances since DODB only is a
.I very
small layer over a hash table.
.SS RAM-only database
Instanciate a RAM-only database is as simple as the other options.
Moreover, this database has exactly the same API as the others, thus changing from one to another is painless.
A path is still required despite the database being only in memory because indexes can still be instanciated for the database, and those indexes will require this directory.
As for the database API itself, changing from a version of an index to another is painless.
This way, one can opt for a cached index and, after some time not using the file-system representation, decide to change for its RAM-only version; a 4-character modification and nothing else.
.
.
.
.SECTION DODB and memory constraint
In contrast with the previous section, some environments have a memory constraint.
For example, in case the database is larger than the available memory, it won't be possible to use a data cache\*[*].
.FOOTNOTE1
Keep in mind that for the moment "cached database" means "all data in memory".
It is perfectly reasonable to have a cached database with a policy of keeping just a certain amount of values in memory, in order to limit the memory required by selecting the relevant values to keep in cache (the most recently used, for example).
function returns an empty list in case the search failed.
.br
The implementation was designed to be simple (7 lines of code), not efficient.
However, with data and index caches, the search is expected to meet about everyone's requirements, speed-wise, given that the tags are small enough (a few thousand entries).
DODB provides basic database operations such as storing, searching, modifying and removing data.
Though, SQL databases have a few
.I properties
enabling a more standardized behavior and may create some expectations towards databases from a general public standpoint.
These properties are called "ACID": atomicity, consistency, isolation and durability.
DODB doesn't fully handle ACID properties.
DODB doesn't provide
.I atomicity .
Instructions cannot be chained and rollback if one of them fails.
DODB doesn't handle
.I consistency .
There is currently no mechanism to prevent adding invalid values.
.I Isolation
is partially taken into account with a locking mechanism preventing race conditions.
Though, parallelism is mostly required to respond to a large number of clients at the same time.
Also, SQL databases require a communication with an inherent latency between the application and the database, slowing down the requests despite the fast algorithms to search for a value within the database.
Parallelism is required for SQL databases because of this latency (at least partially), which doesn't exist with DODB\*[*].
.FOOTNOTE1
FYI, the service
.I netlib.re
uses DODB and since the database is fast enough, parallelism isn't required despite enabling more than a thousand requests per second.
.FOOTNOTE2
With a cache, data is retrieved five hundred times quicker than with a SQL database.
(a) basic indexes, representing 1 to 1 relations, the document's attribute is related to a value and each value of this attribute is unique,
(b) partitions, representing 1 to n relations, the attribute has a value and this value can be shared by other documents,
(c) tags, representing n to n relations, enabling the attribute to have multiple values whose are shared by other documents.
The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
.BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
.ENDBULLET
.FOOTNOTE1
Having a cached database will probably be the most widespread use of DODB.
When memory isn't scarce, there is no point not using it to achieve better performance.
.FOOTNOTE2
The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
.I disk
is actually a
.I "temporary file-system (tmpfs)"
to enable maximum efficiency.
.FOOTNOTE1
A very simple $50 PC, buyed online.
Nothing fancy.
.FOOTNOTE2
The library is written in Crystal and so is the benchmark (\f[CW]spec/benchmark-cars.cr\f[]).
Nonetheless, despite a few technicalities, the objective of this document is to provide an insight on the approach used in DODB more than this particular implementation.
The manipulated data type can be found in \f[CW]spec/db-cars.cr\f[].
Since there is only one value to retrieve, the request is quick and time is almost constant.
When the value and the index are kept in memory (see \f[CW]RAM only\f[] and \f[CW]Cached db\f[]), the retrieval is almost instantaneous (about 50 to 120 ns).
In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db, cached index\f[]).
The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.
Right now, security isn't managed in DODB, at all.
Sure, DODB isn't vulnerable to SQL injections, but an internet-facing application may encounter a few other problems including, but not limited to, code injection, buffer overflows, etc.
Of course, DODB isn't a mechanism to protect applications from any possible attack, so most of the vulnerabilities cannot be countered by the library.
However, a few security mechanisms exist to prevent data leak or data modification from an outsider, and the DODB library should implement some of them in the future.
.B "Preventing data leak" .
Since DODB is a library, any attack on the application using it can lead to a data leak.
For the moment, any part of the application can access data stored in memory.
Operating systems provide system calls to protect parts of the allocated memory;
.FUNCTION_CALL mlock ,
.FUNCTION_CALL mprotect
prevents a region of memory from being put in the swap.
.B "Discussion on security, not related to DODB" .
No authorization mechanism prevents the application to access un-authorized data, including, but not limited to, any file on the file-system.