dodb.cr/graphs/graphs.ms

.so macros.roff
.de TREE1
.QP
.ps -2
.KS
.ft CW
.b1
.nf
..
.de TREE2
.ft
.fi
.b2
.ps
.KE
.QE
..
.
.de COMMAND
.I \\$*
..
.de DIRECTORY
.I \\$*
..
.
. \" The document starts here.
.
.TITLE Document Oriented DataBase (DODB)
.AUTHOR Philippe P.
.ABSTRACT1
DODB is a database-as-library, enabling a very simple way to store applications' data: storing serialized
.I documents
(basically any data type) in plain files.
To speed-up searches, attributes of these documents can be used as indexes which leads to create a few symbolic links
.I symlinks ) (
on the disk.

This document briefly presents DODB and its main differences with other database engines.
An experiment is described and analysed to understand the performance that can be expected from this approach.
.ABSTRACT2
.SINGLE_COLUMN
.SECTION Introduction to DODB
A database consists in managing data, enabling queries (preferably fast) to retrieve, to modify, to add and to delete a piece of information.
Anything else is
.UL accessory .

Universities all around the world teach about Structured Query Language (SQL) and relational databases.
.
.de PRIMARY_KEY
.I \\$1 \\$2 \\$3
..
.de FOREIGN_KEY
.I \\$1 \\$2 \\$3
..

.UL "Relational databases"
are built around the idea to put data into
.I tables ,
with typed columns so the database can optimize operations and storage.
A database is a list of tables with relations between them.
For example, let's imagine a database of a movie theater.
The database will have a
.I table
for the list of movies they have
.PRIMARY_KEY idmovie , (
title, duration, synopsis),
a table for the scheduling
.PRIMARY_KEY idschedule , (
.FOREIGN_KEY idmovie ,
.FOREIGN_KEY idroom ,
time slot),
a table for the rooms
.PRIMARY_KEY idroom , (
name), etc.
Tables have relations, for example the table "scheduling" has a column
.I idmovie
which points to entries in the "movie" table.

.UL "The SQL language"
enables arbitrary operations on databases: add, search, modify and delete entries.
Furthermore, SQL also enables to manage administrative operations of the databases themselves: creating databases and tables, managing users with fine-grained authorizations, etc.
SQL is used between the application and the database, to perform operations and to provide results when due.
SQL is also used
.UL outside
the application, by admins for managing databases and potentially by some
.I non-developer
users to retrieve some data without a dedicated interface\*[*].
.FOOTNOTE1
One of the first objectives of SQL was to enable a class of
.I non-developer
users to talk directly to the database so they can access the data without bothering the developers.
This has value for many companies and organizations.
.FOOTNOTE2

Many tools were used or even developed over the years specifically to aleviate the inherent complexity and limitations of SQL.
For example, designing databases becomes difficult when the list of tables grows;
Unified Modeling Language (UML) is then used to provide a graphical overview of the relations between tables.
SQL databases may be fast to retrieve data despite complicated operations, but when multiple sequential operations are required they become slow because of all the back-and-forths with the application;
thus, SQL databases can be scripted to automate operations and provide a massive speed up
.I "stored procedures" , (
see
.I "PL/SQL" ).
Writing SQL requests requires a lot of boilerplate since there is no integration in the programming languages, leading to multiple function calls for any operation on the database;
thus, object-relational mapping (ORM) libraries were created to reduce the massive code duplication.
And so on.

For many reasons, SQL is not a silver bullet to
.I solve
the database problem.
The encountered difficulties mentioned above and the original objectives of SQL not being universal\*[*], other database designs were created\*[*].
.FOOTNOTE1
To say the least!
Not everyone needs to let users access the database without going through the application.
For instance, writing a \f[I]blog\f[] for a small event or to share small stories about your life doesn't require manual operations on the database, fortunately.
.FOOTNOTE2
.FOOTNOTE1
A lot of designs won't be mentioned here.
The actual history of databases is often quite unclear since the categories of databases are sometimes vague, underspecified.
As mentioned, SQL is not a silver bullet and a lot of developers shifted towards other solutions, that's the important part.
.FOOTNOTE2
The NoSQL movement started because the stated goals of many actors from the early Web boom were different from SQL.
The need for very fast operations far exceeded what was practical at the moment with SQL.
This led to the use of more basic methods to manage data such as
.I "key-value stores" ,
which simply associate a value with an
.I index
for fast retrieval.
In this case, there is no need for the database to have
.I tables ,
data may be untyped, the entries may even have different attributes.
Since homogeneity is not necessary anymore, databases have fewer (or different) constraints.
Document-oriented databases are a sub-class of key-value stores, where metadata can be extracted from the entries for further optimizations.
And that's exactly what is being done in Document Oriented DataBase (DODB).

.UL "Contrary to SQL" ,
DODB has a very narrow scope: to provide a library enabling to store, retrieve, modify and delete data.
In this way, DODB transforms any application in a database manager.
DODB doesn't provide an interactive shell, there is no request language to perform arbitrary operations on the database, no statistical optimizations of the requests based on query frequencies, etc.
Instead, DODB reduces the complexity of the infrastructure, stores data in plain files and enables simple manual scripting with widespread unix tools.
Simplicity is key.

.UL "Contrary to other NoSQL databases" ,
DODB doesn't provide an application but a library, nothing else.
The idea is to help developers to store their data themselves, not depending on
. I yet-another-all-in-one
massive tool.
The library writes (and removes) data on a storage device, has a few retrieval and update mechanisms and that's it\*[*].
.FOOTNOTE1
The lack of features
.I is
the feature.
Even with that motto, the tool still is expected to be convenient for most applications.
.FOOTNOTE2

This document will provide an extensive documentation on how DODB works and how to use it.
The presented code is in Crystal such as the DODB library for now, but keep in mind that this document is all about the method more that the actual implementation, anyone could implement the exact same library in almost every other language.
Limitations are also clearly stated in a dedicated section.
A few experiments are described to provide an overview of the performance you can expect from this approach.
Finally, a conclusion is drawn based on a real-world usage of this library.
.
.SECTION How DODB works and basic usage
DODB is a hash table.
The key of the hash is an auto-incremented number and the value is the stored data.
The following section will explain how to use DODB for basic cases including the few added mechanisms to speed-up searches.
Also, the file-system representation of the data will be presented since it enables easy off-application searches.
.
.
.SS Before starting: the example database
First things first, the following code is the structure used in the rest of the document to present the different aspects of DODB.
This is a simple object
.I Car ,
with a name, a color and a list of associated keywords (fast, elegant, etc.).
.SOURCE Ruby ps=10
class Car
	property name     : String
	property color    : String
	property keywords : Array(String)
end
.SOURCE
.
.
.SS DODB basic usage
Let's create a DODB database for our cars.
.SOURCE Ruby ps=10
# Database creation
database = DODB::DataBase(Car).new "path/to/db-cars"

# Adding an element to the db
database << Car.new "Corvet", "red", ["elegant", "fast"]

# Reaching all objects in the database
database.each do |car|
	pp! car
end
.SOURCE
When a value is added, it is serialized\*[*] and written in a dedicated file.
.FOOTNOTE1
Serialization is currently in JSON.
CBOR is a work-in-progress.
Nothing binds DODB to a particular format.
.FOOTNOTE2
The key of the hash is a number, auto-incremented, used as the name of the stored file.
The following example shows the content of the file system after adding the first car.
.TREE1
$ tree db-cars/
db-cars/
|-- data
|   `-- 0000000000   <- the first car in the database
`-- last-index
.TREE2
In this example, the directory
.I db-cars/data
contains the serialized value, with a formated number as file name.
The file "0000000000" contains the following:
.QP
.SOURCE JSON ps=10
{
  "name": "Corvet",
  "color": "red",
  "keywords": [
    "elegant",
    "fast"
  ]
}
.SOURCE
The car is serialized as expected in the file
.I 0000000000 .
.QE
.de FUNCTION_CALL
.I \\$*
..
.
.
Next step, to retrieve, to modify or to delete a value, its key will be required.
.
.QP
.SOURCE Ruby ps=10
# Get a value based on its key.
database[key]

# Update a value based on its key.
database[key] = new_value

# Delete a value based on its key.
database.delete 0
.SOURCE
.QE
.
The function
.FUNCTION_CALL each_with_index
lists the entries with their keys.
.
.QP
.SOURCE Ruby ps=10
database.each_with_index do |value, key|
	puts "#{key}: #{value}"
end
.SOURCE
.QE
Of course, browsing the entire database to find a value (or its key) is a waste of resources and isn't practical for any non-trivial database.
That is when indexes come into play.
.
.
.SS Indexes
Entries can be
.I indexed
based on their attributes.
There are currently three main ways to search for a value by its attributes: basic indexes, partitions and tags.
.
.SSS Basic indexes (1 to 1 relations)
Basic indexes represent one-to-one relations, such as an index in SQL.
In the Car database, each car has a dedicated (unique) name.
This
.I name
attribute can be used to speed-up searches.
.QP
.SOURCE Ruby ps=10
# Create an index based on the "name" attribute of the cars.
cars_by_name = cars.new_index "name", do |car|
	car.name
end
.SOURCE
Once the index has been created, every added or modified entry in the database will be indexed.
Adding an index (basic index, partition or tag) provides an object used to manipulate the database based on this index.
Let's call it an
.I "index object" .
.QE
.
The
.I "index object"
has several useful functions.
.QP
.SOURCE Ruby ps=10
# Retrieve the car named "Corvet".
corvet = cars_by_name.get? "Corvet"

# Modify the car named "Corvet".
new_car = Car.new "Corvet", "green", ["eco-friendly"]
cars_by_name.update "Corvet", new_car

# In case the index hasn't changed (the name attribute in this example),
# the update can be even simpler.
new_car = Car.new "Corvet", "green", ["eco-friendly"]
cars_by_name.update new_car

# Delete the car named "Corvet".
cars_by_name.delete "Corvet"
.SOURCE
A car can now be searched, modified or deleted based on its name.
.QE
.
.
On the file-system, indexes are represented as symbolic links.
.TREE1
storage
+-- data
|    `-- 0000000000   <- the car named "Corvet"
`-- indexes
      `-- by_name
          `-- Corvet -> ../../data/0000000000
.TREE2
.QP
As shown, the file "Corvet" is a symbolic link to a data file.
The name of the symlink file has been extracted from the value itself, enabling to list all the cars and their names with a simple
.COMMAND ls
in the
.DIRECTORY storage/indexes/by_name/
directory.
.QE
.
The basic indexes as shown in this section already give a taste of what is possible to do with DODB.
The following indexes will cover some other usual cases.
.
.
.SSS Partitions (1 to n relations)
An attribute can have a value that is shared by other entries in the database, such as the
.I color
attribute of our cars.

.SOURCE Ruby ps=10
# Create a partition based on the "color" attribute of the cars.
cars_by_color = database.new_partition "color", do |car|
	car.color
end
.SOURCE
As with basic indexes, once the partition is asked to the database, every new or modified entry will be indexed.

.KS
Let's imagine having 3 cars, one is blue and the other two are red.
.TREE1
$ tree db-cars/
db-cars
+-- data
|    +-- 0000000000   <- this car is blue
|    +-- 0000000001   <- this car is red
|    `-- 0000000002   <- this car is red, too
|   ...
`-- partitions
      `-- by_color
        +-- blue
             `-- 0000000000 -> 0000000000
        `-- red
             +-- 0000000001 -> 0000000001
             `-- 0000000002 -> 0000000002
.TREE2
.QP
Listing all the blue cars is simple as a
.COMMAND ls
in the
.DIRECTORY db-cars/partitions/by_color/blue
directory!
.QE
.KE
.
.
.
.SSS Tags (n to n relations)
Tags are basically partitions but the attribute can have multiple values.

.SOURCE Ruby ps=10
# Create a tag based on the "keywords" attribute of the cars.
cars_by_keywords = database.new_tags "keywords", do |car|
	car.keywords
end
.SOURCE
As with other indexes, once the tag is requested to the database, every new or modified entry will be indexed.
.
.
.KS
Let's imagine having two cars with different associated keywords.
.TREE1
$ tree db-cars/
db-cars
+-- data
|    +-- 0000000000   <- this car is fast and cheap
|    `-- 0000000001   <- this car is fast and elegant
`-- partitions
      `-- by_color
        +-- cheap
            `-- 0000000000 -> 0000000000
        `-- fast
            +-- 0000000000 -> 0000000000
            `-- 0000000001 -> 0000000001
.TREE2
.QP
Listing all the fast cars is simple as a
.COMMAND ls
in the
.DIRECTORY db-cars/tags/by_keywords/fast
directory!
.QE
.KE
.
.
.
.SSS Side note about indexes
DODB presents a few possible indexes (basic indexes, partitions and tags) which respond to an obvious need for fast searches.
Though, their implementation via the creation of symlinks is the result of a certain vision about how a database should behave in order to provide a practical way for users to sort the entries.
The implementation can be completely changed.

Also, other kinds of indexes could
.B easily
be implemented in addition of those presented.
The new indexes may have completely different objectives than providing a file-system representation of the data.
The following sections will precisely cover this aspect.
.
.
.SECTION DODB, slow? Nope. Let's talk about caches
The file-system representation (of data and indexes) is convenient for the administrator, but input-output operations on a file-system are slow.
Storing the data on a storage device is required to protect it from crashes and application restarts.
But data can be kept in memory for faster processing of requests.

The DODB library has an API close to a hash table.
Having a data cache is as simple as keeping a hash table in memory besides providing a file-system storage, the retrieval becomes incredibly fast\*[*].
.FOOTNOTE1
Several hundred times faster, see the experiment section.
.FOOTNOTE2
Same thing for cached indexes.
Indexes can easily be cached, thanks to simple hash tables.
.
.
.SS Cached database
A cached database has the same API as the other DODB databases.
.QP
.SOURCE Ruby ps=10
# Create a cached database
database = DODB::CachedDataBase(Car).new "path/to/db-cars"
.SOURCE
All operations of the
.I DODB::DataBase
class are available for
.I DODB::CachedDataBase .
.QE
.
.SS Cached indexes
Since indexes do not require nearly as much memory as caching the entire database, they are cached by default.
.
.
.SECTION RAM-only database for short-lived data
Databases are built around the objective to actually
.I store
data.
But sometimes the data has only the same lifetime as the application.
Stop the application and the data itself become irrelevant, which happens in several occasions, for instance when the application keeps track of the connected users.
This case is not covered by traditional databases; this is out-of-scope, short-lived data only is handled within the application.
Yet, since DODB is a library and not a separate application (read: DODB is incredibly faster), this usage of the database can be relevant.
Having the same API to handle both long and short-lived data can be useful.
Moreover, the previously mentioned indexes (basic indexes, partitions and tags) would also work the same way for these short-lived data.
Of course, in this case, the file-system representation may be completely irrelevant.
And for all these reasons, the
.I RAM-only
DODB database and
.I RAM-only
indexes were created.

Let's recap the advantages of the RAM-only DODB database.
The DODB API is the same for short-lived (read: temporary) and long-lived data.
This includes the same indexes too, so a file-system representation of the current state of the application is possible.
RAM-only also means incredible performances since DODB only is a
.I very
small layer over a hash table.
.SS RAM-only database
Instanciate a RAM-only database is as simple as the other options.
Moreover, this database has exactly the same API as the others, thus changing from one to another is painless.
.QP
.SOURCE Ruby ps=10
# RAM-only database creation
database = DODB::RAMOnlyDataBase(Car).new "path/to/db-cars"
.SOURCE
Yes, the path still is required which may be seen as a quirk but the rationale\*[*] is sound.
.QE
.FOOTNOTE1
A path is still required despite the databse being only in memory for two reasons.
First, indexes can still be instanciated for the database, and those indexes can provide a file-system representation of the data.
Second, I worked enough already, leave me alone.
.FOOTNOTE2
.SS RAM-only indexes
All indexes have their RAM-only counterpart.
.QP
.SOURCE Ruby ps=10
# RAM-only basic indexes.
cars_by_name = cars.new_RAM_index "name", &.name

# RAM-only partitions.
cars_by_colors = cars.new_RAM_partition "color", &.color

# RAM-only tags.
cars_by_keywords = cars.new_RAM_tags "keywords", &.keywords
.SOURCE
The API of the
.I "RAM-only index objects"
is exactly the same as the others.
.QE
As for the database API itself, changing from a version of an index to another is painless.
This way, one can opt for a cached index and, after some time not using the file-system representation, decide to change for its RAM-only version; a 4-character modification and nothing else.
.
.
.
.SECTION DODB and memory constraint
In contrast with the previous section, some environments have a memory constraint.
For example, in case the database is larger than the available memory, it won't be possible to use a data cache\*[*].
.FOOTNOTE1
Keep in mind that for the moment "cached database" means "all data in memory".
It is perfectly reasonable to have a cached database with a policy of keeping just a certain amount of values in memory, in order to limit the memory required by selecting the relevant values to keep in cache (the most recently used, for example).
But for now, the cached version keeps everything.
See the "Future work" section.
.FOOTNOTE2
.
.SS Uncached database
By default, the database (provided by
.I "DODB::DataBase" )
isn't cached.
.
.SS Uncached indexes
Cached indexes do not require a large amount of memory since the only stored data is an integer (the
.I key
of the data).
For that reason, indexes are cached by default.
But for highly memory-constrained environments, the cache can be removed.
.QP
.SOURCE Ruby ps=10
# Uncached basic indexes.
cars_by_name = cars.new_uncached_index "name", &.name

# Uncached partitions.
cars_by_colors = cars.new_uncached_partition "color", &.color

# Uncached tags.
cars_by_keywords = cars.new_uncached_tags "keywords", &.keywords
.SOURCE
The API of the
.I "uncached index objects"
is exactly the same as the others.
.QE
.
.
.
.SECTION Recap of the DODB API
.TBD
.SS Database creation
.SS Database update and deletion with the key
.SS Indexes creation
.SS Database update and deletion with an index
.SSS Tags: specific functions
.
.
.
.SECTION Limits of DODB
DODB provides basic database operations such as storing, searching, modifying and removing data.
Though, SQL databases have a few
.I properties
enabling a more standardized behavior and may create some expectations towards databases from a general public standpoint.
These properties are called "ACID": atomicity, consistency, isolation and durability.
DODB doesn't fully handle ACID properties.

DODB doesn't provide
.I atomicity .
Instructions cannot be chained and rollback if one of them fails.

DODB doesn't handle
.I consistency .
There is currently no mechanism to prevent adding invalid values.

.I Isolation
is partially taken into account with a locking mechanism preventing race conditions.
Though, parallelism is mostly required to respond to a large number of clients at the same time.
Also, SQL databases require a communication with an inherent latency between the application and the database, slowing down the requests despite the fast algorithms to search for a value within the database.
Parallelism is required for SQL databases because of this latency (at least partially), which doesn't exist with DODB\*[*].
.FOOTNOTE1
FYI, the service
.I netlib.re
uses DODB and since the database is fast enough, parallelism isn't required despite enabling more than a thousand requests per second.
.FOOTNOTE2
With a cache, data is retrieved five hundred times quicker than with a SQL database.
Thus, parallelism is probably not needed but a locking mechanism is provided anyway, just in case; this may be overly simplistic but
.SHINE "good enough"
for most applications.

.I Durability
is taken into account.
Data is written on disk each time it changes.
Again, this is basic but
.SHINE "good enough"
for most applications.

.B "Discussion on ACID properties" .
The author of this document sees these database properties as a sort of "fail-safe".
Always nice to have, but not entirely necessary; at least not for every single application.
DODB will provide some form of atomicity and consistency at some point, but nothing fancy nor too advanced.
The whole point of the DODB project is to keep the code simple (almost
.B "stupidly"
simple).
Not handling these properties isn't a limitation of the DODB approach but a choice for this project\*[*].
.FOOTNOTE1
Which results from a lack of time, mostly.
.FOOTNOTE2

Not handling all the ACID properties within the DODB library doesn't mean they cannot be achieved.
Applications can have these properties, often with just a few lines of code.
They just don't come
.I "by default"
with the library\*[*].
.FOOTNOTE1
As a side note, the
.I consistency
property is often taken care of within the application despite being handled by the database, for various reasons.
.FOOTNOTE2
.
.
.
.SECTION Experimental scenario
.LP
The following experiment shows the performance of DODB based on querying durations.
Data can be searched via
.I indexes ,
as for SQL databases.
Three possible indexes exist in DODB:
(a) basic indexes, representing 1 to 1 relations, the document's attribute is related to a value and each value of this attribute is unique,
(b) partitions, representing 1 to n relations, the attribute has a value and this value can be shared by other documents,
(c) tags, representing n to n relations, enabling the attribute to have multiple values whose are shared by other documents.

The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
Loop and repeat.

Four instances of DODB are tested:
.BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory);
.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes;
.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*];
.BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
.ENDBULLET
.FOOTNOTE1
Having a cached database will probably be the most widespread use of DODB.
When memory isn't scarce, there is no point not using it to achieve better performance.
.FOOTNOTE2

The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
.I disk
is actually a
.I "temporary file-system (tmpfs)"
to enable maximum efficiency.
.FOOTNOTE1
A very simple $50 PC, buyed online.
Nothing fancy.
.FOOTNOTE2

The library is written in Crystal and so is the benchmark (\f[CW]spec/benchmark-cars.cr\f[]).
Nonetheless, despite a few technicalities, the objective of this document is to provide an insight on the approach used in DODB more than this particular implementation.

The manipulated data type can be found in \f[CW]spec/db-cars.cr\f[].
.SOURCE Ruby ps=9 vs=9p
class Car
	property name     : String        # 1-1 relation
	property color    : String        # 1-n relation
	property keywords : Array(String) # n-n relation
end
.SOURCE
.
.
.SS Basic indexes (1 to 1 relations)
.LP
An index enables to match a single value based on a small string.
In our example, each \f[CW]car\f[] has an unique \fIname\f[] which is used as an index.

The following graph represents the result of 100 queries of a car based on its name.
The experiment starts with a database containing 1,000 cars and goes up to 250,000 cars.

.so graph_query_index.grap

Since there is only one value to retrieve, the request is quick and time is almost constant.
When the value and the index are kept in memory (see \f[CW]RAM only\f[] and \f[CW]Cached db\f[]), the retrieval is almost instantaneous (about 50 to 120 ns).
In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db, cached index\f[]).
The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.

.ps -2
.TS
allbox tab(:);
c | lw(3.6i) | cew(1.4i).
DODB instance:Comment and database usage:T{
compared to RAM-only
T}
RAM only:T{
Worst memory footprint, best performance.
T}:-
Cached db and index:T{
Performance for retrieving a value is the same as RAM only while
enabling the admin to manually search for data on-disk.
T}:about the same perfs
Uncached db, cached index::300 to 400x slower
Uncached db and index:T{
Best memory footprint, worst performance.
T}:400 to 500x slower
.TE
.ps \n[PS]

.B Conclusion :
as expected, retrieving a single value is fast and the size of the database doesn't matter much.
Each deserialization and, more importantly, each disk access is a pain point.
Caching the value enables a massive performance gain, data can be retrieved several hundred times quicker.
.bp
.SS Partitions (1 to n relations)
.LP

.so graph_query_partition.grap

.bp
.SS Tags (n to n relations)
.LP
.so graph_query_tag.grap
.
.SECTION Future work
This section presents all the features I want to see in a future version of the DODB library.
.SS Cached database and indexes with selective memory
Right now, both cached database and cached indexes will store any cached value indefinitively.
Giving the cache the ability to select the values to keep in memory would enable a massive speed-up even in memory-constrained environments.
The policy could be as simple as keeping in memory only the most recently requested values.

These new versions of cached database and indexes will become the standard, default DODB behavior.
.SS Pagination via the indexes: offset and limit
Right now, browsing the entire database by requesting a limited list at a time is possible, thanks to some functions accepting an
.I offset
and a
.I size .
However, this is not possible with the indexes, thus when querying for example a partition the API provides the entire list of matching values.
This is not acceptable for databases with large partitions and tags: memory will be over-used and requests will be slow.
.SECTION Conclusion
.TBD
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								.so macros.roff
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								.de TREE1
 								.QP
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								.ps -2
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								.KS
 								.ft CW
 								.b1
 								.nf
 								..
 								.de TREE2
 								.ft
 								.fi
 								.b2
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								.ps
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								.KE
 								.QE
 								..
 								.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.de COMMAND
 								.I \\$*
 								..
 								.de DIRECTORY
 								.I \\$*
 								..
 								.
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								. \" The document starts here.
 								.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.TITLE Document Oriented DataBase (DODB)
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								.AUTHOR Philippe P.
 								.ABSTRACT1
 								DODB is a database-as-library, enabling a very simple way to store applications' data: storing serialized
 								.I documents
 								(basically any data type) in plain files.
 								To speed-up searches, attributes of these documents can be used as indexes which leads to create a few symbolic links
 								.I symlinks ) (
 								on the disk.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								This document briefly presents DODB and its main differences with other database engines.
 								An experiment is described and analysed to understand the performance that can be expected from this approach.
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								.ABSTRACT2
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								.SINGLE_COLUMN
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.SECTION Introduction to DODB
 								A database consists in managing data, enabling queries (preferably fast) to retrieve, to modify, to add and to delete a piece of information.
 								Anything else is
 								.UL accessory .
 								Universities all around the world teach about Structured Query Language (SQL) and relational databases.
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								.
 								.de PRIMARY_KEY
 								.I \\$1 \\$2 \\$3
 								..
 								.de FOREIGN_KEY
 								.I \\$1 \\$2 \\$3
 								..
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								.UL "Relational databases"
 								are built around the idea to put data into
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.I tables ,
 								with typed columns so the database can optimize operations and storage.
 								A database is a list of tables with relations between them.
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								For example, let's imagine a database of a movie theater.
 								The database will have a
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.I table
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								for the list of movies they have
 								.PRIMARY_KEY idmovie , (
 								title, duration, synopsis),
 								a table for the scheduling
 								.PRIMARY_KEY idschedule , (
 								.FOREIGN_KEY idmovie ,
 								.FOREIGN_KEY idroom ,
 								time slot),
 								a table for the rooms
 								.PRIMARY_KEY idroom , (
 								name), etc.
 								Tables have relations, for example the table "scheduling" has a column
 								.I idmovie
 								which points to entries in the "movie" table.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								.UL "The SQL language"
 								enables arbitrary operations on databases: add, search, modify and delete entries.
 								Furthermore, SQL also enables to manage administrative operations of the databases themselves: creating databases and tables, managing users with fine-grained authorizations, etc.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								SQL is used between the application and the database, to perform operations and to provide results when due.
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								SQL is also used
 								.UL outside
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								the application, by admins for managing databases and potentially by some
 								.I non-developer
 								users to retrieve some data without a dedicated interface\*[*].
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								.FOOTNOTE1
 								One of the first objectives of SQL was to enable a class of
 								.I non-developer
 								users to talk directly to the database so they can access the data without bothering the developers.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								This has value for many companies and organizations.
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								.FOOTNOTE2
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
 								Many tools were used or even developed over the years specifically to aleviate the inherent complexity and limitations of SQL.
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								For example, designing databases becomes difficult when the list of tables grows;
 								Unified Modeling Language (UML) is then used to provide a graphical overview of the relations between tables.
 								SQL databases may be fast to retrieve data despite complicated operations, but when multiple sequential operations are required they become slow because of all the back-and-forths with the application;
 								thus, SQL databases can be scripted to automate operations and provide a massive speed up
 								.I "stored procedures" , (
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								see
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								.I "PL/SQL" ).
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								Writing SQL requests requires a lot of boilerplate since there is no integration in the programming languages, leading to multiple function calls for any operation on the database;
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								thus, object-relational mapping (ORM) libraries were created to reduce the massive code duplication.
 								And so on.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
-												Let's shit on SQL a bit more.

											
										
										
											2024-05-14 13:51:13 +02:00
+								For many reasons, SQL is not a silver bullet to
 								.I solve
 								the database problem.
 								The encountered difficulties mentioned above and the original objectives of SQL not being universal\*[*], other database designs were created\*[*].
 								.FOOTNOTE1
 								To say the least!
 								Not everyone needs to let users access the database without going through the application.
 								For instance, writing a \f[I]blog\f[] for a small event or to share small stories about your life doesn't require manual operations on the database, fortunately.
 								.FOOTNOTE2
 								.FOOTNOTE1
 								A lot of designs won't be mentioned here.
 								The actual history of databases is often quite unclear since the categories of databases are sometimes vague, underspecified.
 								As mentioned, SQL is not a silver bullet and a lot of developers shifted towards other solutions, that's the important part.
 								.FOOTNOTE2
-												DODB

											
										
										
											2024-05-14 16:19:46 +02:00
+								The NoSQL movement started because the stated goals of many actors from the early Web boom were different from SQL.
 								The need for very fast operations far exceeded what was practical at the moment with SQL.
 								This led to the use of more basic methods to manage data such as
 								.I "key-value stores" ,
 								which simply associate a value with an
 								.I index
 								for fast retrieval.
 								In this case, there is no need for the database to have
 								.I tables ,
 								data may be untyped, the entries may even have different attributes.
 								Since homogeneity is not necessary anymore, databases have fewer (or different) constraints.
 								Document-oriented databases are a sub-class of key-value stores, where metadata can be extracted from the entries for further optimizations.
 								And that's exactly what is being done in Document Oriented DataBase (DODB).
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.UL "Contrary to SQL" ,
 								DODB has a very narrow scope: to provide a library enabling to store, retrieve, modify and delete data.
-												DODB

											
										
										
											2024-05-14 16:19:46 +02:00
+								In this way, DODB transforms any application in a database manager.
 								DODB doesn't provide an interactive shell, there is no request language to perform arbitrary operations on the database, no statistical optimizations of the requests based on query frequencies, etc.
 								Instead, DODB reduces the complexity of the infrastructure, stores data in plain files and enables simple manual scripting with widespread unix tools.
 								Simplicity is key.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
 								.UL "Contrary to other NoSQL databases" ,
 								DODB doesn't provide an application but a library, nothing else.
 								The idea is to help developers to store their data themselves, not depending on
 								. I yet-another-all-in-one
 								massive tool.
 								The library writes (and removes) data on a storage device, has a few retrieval and update mechanisms and that's it\*[*].
 								.FOOTNOTE1
 								The lack of features
 								.I is
 								the feature.
 								Even with that motto, the tool still is expected to be convenient for most applications.
 								.FOOTNOTE2
 								This document will provide an extensive documentation on how DODB works and how to use it.
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
+								The presented code is in Crystal such as the DODB library for now, but keep in mind that this document is all about the method more that the actual implementation, anyone could implement the exact same library in almost every other language.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								Limitations are also clearly stated in a dedicated section.
 								A few experiments are described to provide an overview of the performance you can expect from this approach.
 								Finally, a conclusion is drawn based on a real-world usage of this library.
-												DODB

											
										
										
											2024-05-14 16:19:46 +02:00
+								.
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
+								.SECTION How DODB works and basic usage
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								DODB is a hash table.
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
+								The key of the hash is an auto-incremented number and the value is the stored data.
 								The following section will explain how to use DODB for basic cases including the few added mechanisms to speed-up searches.
 								Also, the file-system representation of the data will be presented since it enables easy off-application searches.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.
 								.
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
+								.SS Before starting: the example database
 								First things first, the following code is the structure used in the rest of the document to present the different aspects of DODB.
 								This is a simple object
 								.I Car ,
 								with a name, a color and a list of associated keywords (fast, elegant, etc.).
 								.SOURCE Ruby ps=10
 								class Car
 									property name     : String
 									property color    : String
 									property keywords : Array(String)
 								end
 								.SOURCE
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.
 								.
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
+								.SS DODB basic usage
 								Let's create a DODB database for our cars.
 								.SOURCE Ruby ps=10
 								# Database creation
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								database = DODB::DataBase(Car).new "path/to/db-cars"
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
 								# Adding an element to the db
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								database << Car.new "Corvet", "red", ["elegant", "fast"]
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								# Reaching all objects in the database
 								database.each do |car|
-												Blah

											
										
										
											2024-05-15 14:15:20 +02:00
+									pp! car
 								end
 								.SOURCE
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								When a value is added, it is serialized\*[*] and written in a dedicated file.
 								.FOOTNOTE1
 								Serialization is currently in JSON.
 								CBOR is a work-in-progress.
 								Nothing binds DODB to a particular format.
 								.FOOTNOTE2
 								The key of the hash is a number, auto-incremented, used as the name of the stored file.
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								The following example shows the content of the file system after adding the first car.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.TREE1
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								$ tree db-cars/
 								db-cars/
 								|-- data
 								|   `-- 0000000000   <- the first car in the database
 								`-- last-index
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.TREE2
 								In this example, the directory
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								.I db-cars/data
 								contains the serialized value, with a formated number as file name.
 								The file "0000000000" contains the following:
 								.QP
 								.SOURCE JSON ps=10
 								{
 								  "name": "Corvet",
 								  "color": "red",
 								  "keywords": [
 								    "elegant",
 								    "fast"
 								  ]
 								}
 								.SOURCE
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								The car is serialized as expected in the file
 								.I 0000000000 .
-												Stuff.

											
										
										
											2024-05-15 14:39:33 +02:00
+								.QE
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.de FUNCTION_CALL
 								.I \\$*
 								..
 								.
 								.
 								Next step, to retrieve, to modify or to delete a value, its key will be required.
 								.
 								.QP
 								.SOURCE Ruby ps=10
 								# Get a value based on its key.
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								database[key]
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
 								# Update a value based on its key.
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								database[key] = new_value
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
 								# Delete a value based on its key.
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								database.delete 0
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.SOURCE
 								.QE
 								.
 								The function
 								.FUNCTION_CALL each_with_index
 								lists the entries with their keys.
 								.
 								.QP
 								.SOURCE Ruby ps=10
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								database.each_with_index do |value, key|
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+									puts "#{key}: #{value}"
 								end
 								.SOURCE
 								.QE
 								Of course, browsing the entire database to find a value (or its key) is a waste of resources and isn't practical for any non-trivial database.
 								That is when indexes come into play.
 								.
 								.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.SS Indexes
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								Entries can be
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.I indexed
 								based on their attributes.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								There are currently three main ways to search for a value by its attributes: basic indexes, partitions and tags.
 								.
 								.SSS Basic indexes (1 to 1 relations)
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								Basic indexes represent one-to-one relations, such as an index in SQL.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								In the Car database, each car has a dedicated (unique) name.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								This
 								.I name
 								attribute can be used to speed-up searches.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.QP
 								.SOURCE Ruby ps=10
 								# Create an index based on the "name" attribute of the cars.
 								cars_by_name = cars.new_index "name", do |car|
 									car.name
 								end
 								.SOURCE
 								Once the index has been created, every added or modified entry in the database will be indexed.
-												Index objects.

											
										
										
											2024-05-16 02:24:24 +02:00
+								Adding an index (basic index, partition or tag) provides an object used to manipulate the database based on this index.
 								Let's call it an
 								.I "index object" .
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.QE
 								.
-												Index objects.

											
										
										
											2024-05-16 02:24:24 +02:00
+								The
 								.I "index object"
 								has several useful functions.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.QP
 								.SOURCE Ruby ps=10
 								# Retrieve the car named "Corvet".
 								corvet = cars_by_name.get? "Corvet"
 								# Modify the car named "Corvet".
 								new_car = Car.new "Corvet", "green", ["eco-friendly"]
 								cars_by_name.update "Corvet", new_car
 								# In case the index hasn't changed (the name attribute in this example),
 								# the update can be even simpler.
 								new_car = Car.new "Corvet", "green", ["eco-friendly"]
 								cars_by_name.update new_car
 								# Delete the car named "Corvet".
 								cars_by_name.delete "Corvet"
 								.SOURCE
 								A car can now be searched, modified or deleted based on its name.
 								.QE
 								.
 								.
 								On the file-system, indexes are represented as symbolic links.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.TREE1
 								storage
 								+-- data
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								|    `-- 0000000000   <- the car named "Corvet"
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								`-- indexes
 								      `-- by_name
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								          `-- Corvet -> ../../data/0000000000
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.TREE2
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.QP
 								As shown, the file "Corvet" is a symbolic link to a data file.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								The name of the symlink file has been extracted from the value itself, enabling to list all the cars and their names with a simple
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.COMMAND ls
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								in the
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.DIRECTORY storage/indexes/by_name/
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								directory.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.QE
 								.
 								The basic indexes as shown in this section already give a taste of what is possible to do with DODB.
 								The following indexes will cover some other usual cases.
 								.
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.SSS Partitions (1 to n relations)
 								An attribute can have a value that is shared by other entries in the database, such as the
 								.I color
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								attribute of our cars.
 								.SOURCE Ruby ps=10
 								# Create a partition based on the "color" attribute of the cars.
 								cars_by_color = database.new_partition "color", do |car|
 									car.color
 								end
 								.SOURCE
 								As with basic indexes, once the partition is asked to the database, every new or modified entry will be indexed.
 								.KS
 								Let's imagine having 3 cars, one is blue and the other two are red.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.TREE1
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								$ tree db-cars/
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								db-cars
 								+-- data
 								|    +-- 0000000000   <- this car is blue
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								|    +-- 0000000001   <- this car is red
 								|    `-- 0000000002   <- this car is red, too
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								|   ...
 								`-- partitions
 								      `-- by_color
 								        +-- blue
 								             `-- 0000000000 -> 0000000000
 								        `-- red
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								             +-- 0000000001 -> 0000000001
 								             `-- 0000000002 -> 0000000002
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.TREE2
 								.QP
 								Listing all the blue cars is simple as a
 								.COMMAND ls
 								in the
 								.DIRECTORY db-cars/partitions/by_color/blue
 								directory!
 								.QE
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								.KE
 								.
 								.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.
 								.SSS Tags (n to n relations)
 								Tags are basically partitions but the attribute can have multiple values.
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
 								.SOURCE Ruby ps=10
 								# Create a tag based on the "keywords" attribute of the cars.
 								cars_by_keywords = database.new_tags "keywords", do |car|
 									car.keywords
 								end
 								.SOURCE
 								As with other indexes, once the tag is requested to the database, every new or modified entry will be indexed.
 								.
 								.
 								.KS
 								Let's imagine having two cars with different associated keywords.
 								.TREE1
 								$ tree db-cars/
 								db-cars
 								+-- data
 								|    +-- 0000000000   <- this car is fast and cheap
 								|    `-- 0000000001   <- this car is fast and elegant
 								`-- partitions
 								      `-- by_color
 								        +-- cheap
 								            `-- 0000000000 -> 0000000000
 								        `-- fast
 								            +-- 0000000000 -> 0000000000
 								            `-- 0000000001 -> 0000000001
 								.TREE2
 								.QP
 								Listing all the fast cars is simple as a
 								.COMMAND ls
 								in the
 								.DIRECTORY db-cars/tags/by_keywords/fast
 								directory!
 								.QE
 								.KE
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.
 								.
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								.SSS Side note about indexes
 								DODB presents a few possible indexes (basic indexes, partitions and tags) which respond to an obvious need for fast searches.
 								Though, their implementation via the creation of symlinks is the result of a certain vision about how a database should behave in order to provide a practical way for users to sort the entries.
 								The implementation can be completely changed.
 								Also, other kinds of indexes could
 								.B easily
 								be implemented in addition of those presented.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								The new indexes may have completely different objectives than providing a file-system representation of the data.
 								The following sections will precisely cover this aspect.
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								.
 								.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.SECTION DODB, slow? Nope. Let's talk about caches
-												Cached and uncached database and indexes.

											
										
										
											2024-05-17 03:28:50 +02:00
+								The file-system representation (of data and indexes) is convenient for the administrator, but input-output operations on a file-system are slow.
 								Storing the data on a storage device is required to protect it from crashes and application restarts.
 								But data can be kept in memory for faster processing of requests.
 								The DODB library has an API close to a hash table.
 								Having a data cache is as simple as keeping a hash table in memory besides providing a file-system storage, the retrieval becomes incredibly fast\*[*].
 								.FOOTNOTE1
 								Several hundred times faster, see the experiment section.
 								.FOOTNOTE2
 								Same thing for cached indexes.
 								Indexes can easily be cached, thanks to simple hash tables.
 								.
 								.
 								.SS Cached database
 								A cached database has the same API as the other DODB databases.
 								.QP
 								.SOURCE Ruby ps=10
 								# Create a cached database
 								database = DODB::CachedDataBase(Car).new "path/to/db-cars"
 								.SOURCE
 								All operations of the
 								.I DODB::DataBase
 								class are available for
 								.I DODB::CachedDataBase .
 								.QE
 								.
 								.SS Cached indexes
 								Since indexes do not require nearly as much memory as caching the entire database, they are cached by default.
 								.
 								.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.SECTION RAM-only database for short-lived data
 								Databases are built around the objective to actually
 								.I store
 								data.
 								But sometimes the data has only the same lifetime as the application.
 								Stop the application and the data itself become irrelevant, which happens in several occasions, for instance when the application keeps track of the connected users.
 								This case is not covered by traditional databases; this is out-of-scope, short-lived data only is handled within the application.
 								Yet, since DODB is a library and not a separate application (read: DODB is incredibly faster), this usage of the database can be relevant.
 								Having the same API to handle both long and short-lived data can be useful.
 								Moreover, the previously mentioned indexes (basic indexes, partitions and tags) would also work the same way for these short-lived data.
 								Of course, in this case, the file-system representation may be completely irrelevant.
 								And for all these reasons, the
 								.I RAM-only
 								DODB database and
 								.I RAM-only
 								indexes were created.
 								Let's recap the advantages of the RAM-only DODB database.
 								The DODB API is the same for short-lived (read: temporary) and long-lived data.
 								This includes the same indexes too, so a file-system representation of the current state of the application is possible.
 								RAM-only also means incredible performances since DODB only is a
 								.I very
 								small layer over a hash table.
 								.SS RAM-only database
 								Instanciate a RAM-only database is as simple as the other options.
 								Moreover, this database has exactly the same API as the others, thus changing from one to another is painless.
 								.QP
 								.SOURCE Ruby ps=10
 								# RAM-only database creation
 								database = DODB::RAMOnlyDataBase(Car).new "path/to/db-cars"
 								.SOURCE
 								Yes, the path still is required which may be seen as a quirk but the rationale\*[*] is sound.
 								.QE
 								.FOOTNOTE1
 								A path is still required despite the databse being only in memory for two reasons.
 								First, indexes can still be instanciated for the database, and those indexes can provide a file-system representation of the data.
 								Second, I worked enough already, leave me alone.
 								.FOOTNOTE2
 								.SS RAM-only indexes
 								All indexes have their RAM-only counterpart.
 								.QP
 								.SOURCE Ruby ps=10
 								# RAM-only basic indexes.
 								cars_by_name = cars.new_RAM_index "name", &.name
 								# RAM-only partitions.
 								cars_by_colors = cars.new_RAM_partition "color", &.color
 								# RAM-only tags.
 								cars_by_keywords = cars.new_RAM_tags "keywords", &.keywords
 								.SOURCE
 								The API of the
 								.I "RAM-only index objects"
 								is exactly the same as the others.
 								.QE
 								As for the database API itself, changing from a version of an index to another is painless.
 								This way, one can opt for a cached index and, after some time not using the file-system representation, decide to change for its RAM-only version; a 4-character modification and nothing else.
 								.
 								.
 								.
 								.SECTION DODB and memory constraint
 								In contrast with the previous section, some environments have a memory constraint.
 								For example, in case the database is larger than the available memory, it won't be possible to use a data cache\*[*].
 								.FOOTNOTE1
 								Keep in mind that for the moment "cached database" means "all data in memory".
 								It is perfectly reasonable to have a cached database with a policy of keeping just a certain amount of values in memory, in order to limit the memory required by selecting the relevant values to keep in cache (the most recently used, for example).
 								But for now, the cached version keeps everything.
-												Cached and uncached database and indexes.

											
										
										
											2024-05-17 03:28:50 +02:00
+								See the "Future work" section.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.FOOTNOTE2
-												Cached and uncached database and indexes.

											
										
										
											2024-05-17 03:28:50 +02:00
+								.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.SS Uncached database
-												Cached and uncached database and indexes.

											
										
										
											2024-05-17 03:28:50 +02:00
+								By default, the database (provided by
 								.I "DODB::DataBase" )
 								isn't cached.
 								.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.SS Uncached indexes
-												Cached and uncached database and indexes.

											
										
										
											2024-05-17 03:28:50 +02:00
+								Cached indexes do not require a large amount of memory since the only stored data is an integer (the
 								.I key
 								of the data).
 								For that reason, indexes are cached by default.
 								But for highly memory-constrained environments, the cache can be removed.
 								.QP
 								.SOURCE Ruby ps=10
 								# Uncached basic indexes.
 								cars_by_name = cars.new_uncached_index "name", &.name
 								# Uncached partitions.
 								cars_by_colors = cars.new_uncached_partition "color", &.color
 								# Uncached tags.
 								cars_by_keywords = cars.new_uncached_tags "keywords", &.keywords
 								.SOURCE
 								The API of the
 								.I "uncached index objects"
 								is exactly the same as the others.
 								.QE
 								.
 								.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.
 								.SECTION Recap of the DODB API
 								.TBD
 								.SS Database creation
 								.SS Database update and deletion with the key
 								.SS Indexes creation
 								.SS Database update and deletion with an index
 								.SSS Tags: specific functions
-												Cached and uncached database and indexes.

											
										
										
											2024-05-17 03:28:50 +02:00
+								.
 								.
 								.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.SECTION Limits of DODB
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								DODB provides basic database operations such as storing, searching, modifying and removing data.
 								Though, SQL databases have a few
 								.I properties
 								enabling a more standardized behavior and may create some expectations towards databases from a general public standpoint.
 								These properties are called "ACID": atomicity, consistency, isolation and durability.
 								DODB doesn't fully handle ACID properties.
 								DODB doesn't provide
 								.I atomicity .
 								Instructions cannot be chained and rollback if one of them fails.
 								DODB doesn't handle
 								.I consistency .
 								There is currently no mechanism to prevent adding invalid values.
 								.I Isolation
 								is partially taken into account with a locking mechanism preventing race conditions.
 								Though, parallelism is mostly required to respond to a large number of clients at the same time.
 								Also, SQL databases require a communication with an inherent latency between the application and the database, slowing down the requests despite the fast algorithms to search for a value within the database.
 								Parallelism is required for SQL databases because of this latency (at least partially), which doesn't exist with DODB\*[*].
 								.FOOTNOTE1
 								FYI, the service
 								.I netlib.re
 								uses DODB and since the database is fast enough, parallelism isn't required despite enabling more than a thousand requests per second.
 								.FOOTNOTE2
 								With a cache, data is retrieved five hundred times quicker than with a SQL database.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								Thus, parallelism is probably not needed but a locking mechanism is provided anyway, just in case; this may be overly simplistic but
 								.SHINE "good enough"
 								for most applications.
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
 								.I Durability
 								is taken into account.
 								Data is written on disk each time it changes.
 								Again, this is basic but
 								.SHINE "good enough"
 								for most applications.
 								.B "Discussion on ACID properties" .
 								The author of this document sees these database properties as a sort of "fail-safe".
 								Always nice to have, but not entirely necessary; at least not for every single application.
 								DODB will provide some form of atomicity and consistency at some point, but nothing fancy nor too advanced.
 								The whole point of the DODB project is to keep the code simple (almost
 								.B "stupidly"
 								simple).
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								Not handling these properties isn't a limitation of the DODB approach but a choice for this project\*[*].
 								.FOOTNOTE1
 								Which results from a lack of time, mostly.
 								.FOOTNOTE2
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
 								Not handling all the ACID properties within the DODB library doesn't mean they cannot be achieved.
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								Applications can have these properties, often with just a few lines of code.
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								They just don't come
 								.I "by default"
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								with the library\*[*].
 								.FOOTNOTE1
 								As a side note, the
 								.I consistency
 								property is often taken care of within the application despite being handled by the database, for various reasons.
 								.FOOTNOTE2
-												Limitations of the DODB approach.

											
										
										
											2024-05-16 14:42:11 +02:00
+								.
 								.
 								.
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								.SECTION Experimental scenario
-												Graphs!

											
										
										
											2024-05-12 16:47:53 +02:00
+								.LP
-												Side note about indexes.

											
										
										
											2024-05-16 15:23:19 +02:00
+								The following experiment shows the performance of DODB based on querying durations.
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								Data can be searched via
 								.I indexes ,
 								as for SQL databases.
 								Three possible indexes exist in DODB:
 								(a) basic indexes, representing 1 to 1 relations, the document's attribute is related to a value and each value of this attribute is unique,
 								(b) partitions, representing 1 to n relations, the attribute has a value and this value can be shared by other documents,
 								(c) tags, representing n to n relations, enabling the attribute to have multiple values whose are shared by other documents.
 								The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
 								Loop and repeat.
 								Four instances of DODB are tested:
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory);
 								.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes;
 								.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*];
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								.BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
 								The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
 								.ENDBULLET
 								.FOOTNOTE1
 								Having a cached database will probably be the most widespread use of DODB.
 								When memory isn't scarce, there is no point not using it to achieve better performance.
 								.FOOTNOTE2
 								The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
 								.I disk
 								is actually a
 								.I "temporary file-system (tmpfs)"
 								to enable maximum efficiency.
 								.FOOTNOTE1
 								A very simple $50 PC, buyed online.
 								Nothing fancy.
 								.FOOTNOTE2
 								The library is written in Crystal and so is the benchmark (\f[CW]spec/benchmark-cars.cr\f[]).
 								Nonetheless, despite a few technicalities, the objective of this document is to provide an insight on the approach used in DODB more than this particular implementation.
 								The manipulated data type can be found in \f[CW]spec/db-cars.cr\f[].
 								.SOURCE Ruby ps=9 vs=9p
 								class Car
 									property name     : String        # 1-1 relation
 									property color    : String        # 1-n relation
 									property keywords : Array(String) # n-n relation
 								end
 								.SOURCE
 								.
-												DODB: basic usage and basic indexes.

											
										
										
											2024-05-16 02:18:19 +02:00
+								.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.SS Basic indexes (1 to 1 relations)
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								.LP
 								An index enables to match a single value based on a small string.
-												Graph: a few more sentenses.

											
										
										
											2024-05-13 03:38:41 +02:00
+								In our example, each \f[CW]car\f[] has an unique \fIname\f[] which is used as an index.
 								The following graph represents the result of 100 queries of a car based on its name.
 								The experiment starts with a database containing 1,000 cars and goes up to 250,000 cars.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.so graph_query_index.grap
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								Since there is only one value to retrieve, the request is quick and time is almost constant.
 								When the value and the index are kept in memory (see \f[CW]RAM only\f[] and \f[CW]Cached db\f[]), the retrieval is almost instantaneous (about 50 to 120 ns).
 								In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db, cached index\f[]).
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.
-												Details.

											
										
										
											2024-05-17 02:42:19 +02:00
+								.ps -2
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.TS
 								allbox tab(:);
-												Details.

											
										
										
											2024-05-17 02:42:19 +02:00
+								c | lw(3.6i) | cew(1.4i).
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								DODB instance:Comment and database usage:T{
-												Details.

											
										
										
											2024-05-17 02:42:19 +02:00
+								compared to RAM-only
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								T}
 								RAM only:T{
-												Details.

											
										
										
											2024-05-17 02:42:19 +02:00
+								Worst memory footprint, best performance.
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								T}:-
 								Cached db and index:T{
 								Performance for retrieving a value is the same as RAM only while
 								enabling the admin to manually search for data on-disk.
 								T}:about the same perfs
 								Uncached db, cached index::300 to 400x slower
 								Uncached db and index:T{
 								Best memory footprint, worst performance.
 								T}:400 to 500x slower
 								.TE
-												Details.

											
										
										
											2024-05-17 02:42:19 +02:00
+								.ps \n[PS]
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
 								.B Conclusion :
 								as expected, retrieving a single value is fast and the size of the database doesn't matter much.
 								Each deserialization and, more importantly, each disk access is a pain point.
 								Caching the value enables a massive performance gain, data can be retrieved several hundred times quicker.
-												Longer explanation of the experimental scenario.

											
										
										
											2024-05-13 02:24:59 +02:00
+								.bp
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.SS Partitions (1 to n relations)
-												Graph: change the Y scale.

											
										
										
											2024-05-12 19:24:50 +02:00
+								.LP
-												Graphs: starting to look good.

											
										
										
											2024-05-12 20:47:09 +02:00
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.so graph_query_partition.grap
-												Graph: change the Y scale.

											
										
										
											2024-05-12 19:24:50 +02:00
 								.bp
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.SS Tags (n to n relations)
-												Graph: change the Y scale.

											
										
										
											2024-05-12 19:24:50 +02:00
+								.LP
-												DODB PDF.

											
										
										
											2024-05-13 21:46:02 +02:00
+								.so graph_query_tag.grap
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.
 								.SECTION Future work
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								This section presents all the features I want to see in a future version of the DODB library.
 								.SS Cached database and indexes with selective memory
 								Right now, both cached database and cached indexes will store any cached value indefinitively.
 								Giving the cache the ability to select the values to keep in memory would enable a massive speed-up even in memory-constrained environments.
 								The policy could be as simple as keeping in memory only the most recently requested values.
-												Details.

											
										
										
											2024-05-17 02:42:19 +02:00
 								These new versions of cached database and indexes will become the standard, default DODB behavior.
-												Blah.

											
										
										
											2024-05-17 01:43:22 +02:00
+								.SS Pagination via the indexes: offset and limit
 								Right now, browsing the entire database by requesting a limited list at a time is possible, thanks to some functions accepting an
 								.I offset
 								and a
 								.I size .
 								However, this is not possible with the indexes, thus when querying for example a partition the API provides the entire list of matching values.
 								This is not acceptable for databases with large partitions and tags: memory will be over-used and requests will be slow.
-												A few new explanations.

											
										
										
											2024-05-15 03:10:59 +02:00
+								.SECTION Conclusion
 								.TBD