dodb.cr/graphs/graphs.ms

.so macros.roff
.TITLE Document Oriented DataBase (DODB)
.AUTHOR Philippe P.
.ABSTRACT1
DODB is a database-as-library, enabling a very simple way to store applications' data: storing serialized
.I documents
(basically any data type) in plain files.
To speed-up searches, attributes of these documents can be used as indexes which leads to create a few symbolic links
.I symlinks ) (
on the disk.
.br

This document briefly presents DODB and its main differences with other database engines.
An experiment is described and analysed to understand the performance that can be expected from this approach.
.ABSTRACT2
.SECTION Introduction to DODB
A database consists in managing data, enabling queries (preferably fast) to retrieve, to modify, to add and to delete a piece of information.
Anything else is
.UL accessory .

Universities all around the world teach about Structured Query Language (SQL) and relational databases.

The main idea of relational databases is to put data into
.I tables ,
with typed columns so the database can optimize operations and storage.
A database is a list of tables with relations between them.
For example, let's imagine a database of a
.I table
can contain a list of users (their age, height, job, etc.).
When another

The SQL language enables arbitrary operations on databases: add, modify and delete entries.
Furthermore, SQL enables even to manage administrative operations of the databases themselves: managing users with fine-grained authorizations, creating databases and tables, etc.

Many tools were used or even developed over the years specifically to aleviate the inherent complexity and limitations of SQL.
For example, Unified Modeling Language (UML) is used to design databases by providing a graphical overview of the relations between tables.
SQL databases can be scripted to automate operations and provide a massive speed up to the operations (
.I "stored procedures" ,
see
.I "PL/SQL" ),
etc.

Document-oriented databases are key-value stores.
Furthermore, metadata is extracted for further optimization.

Contrary to SQL, DODB has a very narrow scope: to provide
Thus, DODB doesn't provide an interactive shell, no request language to perform arbitrary operations on the database, etc.

.SECTION Basic usage
.SECTION A few more options
.SECTION Limits of DODB
.SECTION Experimental scenario
.LP
The following experiment shows the performance of DODB based on quering durations.
Data can be searched via
.I indexes ,
as for SQL databases.
Three possible indexes exist in DODB:
(a) basic indexes, representing 1 to 1 relations, the document's attribute is related to a value and each value of this attribute is unique,
(b) partitions, representing 1 to n relations, the attribute has a value and this value can be shared by other documents,
(c) tags, representing n to n relations, enabling the attribute to have multiple values whose are shared by other documents.

The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
Loop and repeat.

Four instances of DODB are tested:
.BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory) ;
.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes ;
.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*] ;
.BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
.ENDBULLET
.FOOTNOTE1
Having a cached database will probably be the most widespread use of DODB.
When memory isn't scarce, there is no point not using it to achieve better performance.
.FOOTNOTE2

The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
.I disk
is actually a
.I "temporary file-system (tmpfs)"
to enable maximum efficiency.
.FOOTNOTE1
A very simple $50 PC, buyed online.
Nothing fancy.
.FOOTNOTE2

The library is written in Crystal and so is the benchmark (\f[CW]spec/benchmark-cars.cr\f[]).
Nonetheless, despite a few technicalities, the objective of this document is to provide an insight on the approach used in DODB more than this particular implementation.

The manipulated data type can be found in \f[CW]spec/db-cars.cr\f[].
.SOURCE Ruby ps=9 vs=9p
class Car
	property name     : String        # 1-1 relation
	property color    : String        # 1-n relation
	property keywords : Array(String) # n-n relation
end
.SOURCE
.
.SS Basic indexes (1 to 1 relations)
.LP
An index enables to match a single value based on a small string.
In our example, each \f[CW]car\f[] has an unique \fIname\f[] which is used as an index.

The following graph represents the result of 100 queries of a car based on its name.
The experiment starts with a database containing 1,000 cars and goes up to 250,000 cars.

.so graph_query_index.grap

Since there is only one value to retrieve, the request is quick and time is almost constant.
When the value and the index are kept in memory (see \f[CW]RAM only\f[] and \f[CW]Cached db\f[]), the retrieval is almost instantaneous (about 50 to 120 ns).
In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db, cached index\f[]).
The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.

.TS
allbox tab(:);
c | lw(4.0i) | cew(1.4i).
DODB instance:Comment and database usage:T{
compared to RAM only
T}
RAM only:T{
Worst memory footprint (all data must be in memory), best performance.
T}:-
Cached db and index:T{
Performance for retrieving a value is the same as RAM only while
enabling the admin to manually search for data on-disk.
T}:about the same perfs
Uncached db, cached index::300 to 400x slower
Uncached db and index:T{
Best memory footprint, worst performance.
T}:400 to 500x slower
.TE

.B Conclusion :
as expected, retrieving a single value is fast and the size of the database doesn't matter much.
Each deserialization and, more importantly, each disk access is a pain point.
Caching the value enables a massive performance gain, data can be retrieved several hundred times quicker.
.bp
.SS Partitions (1 to n relations)
.LP

.so graph_query_partition.grap

.bp
.SS Tags (n to n relations)
.LP
.so graph_query_tag.grap