148 lines
6.5 KiB
Plaintext
148 lines
6.5 KiB
Plaintext
.so macros.roff
|
|
.TITLE Document Oriented DataBase (DODB)
|
|
.AUTHOR Philippe P.
|
|
.ABSTRACT1
|
|
DODB is a database-as-library, enabling a very simple way to store applications' data: storing serialized
|
|
.I documents
|
|
(basically any data type) in plain files.
|
|
To speed-up searches, attributes of these documents can be used as indexes which leads to create a few symbolic links
|
|
.I symlinks ) (
|
|
on the disk.
|
|
.br
|
|
|
|
This document briefly presents DODB and its main differences with other database engines.
|
|
An experiment is described and analysed to understand the performance that can be expected from this approach.
|
|
.ABSTRACT2
|
|
.SECTION Introduction to DODB
|
|
A database consists in managing data, enabling queries (preferably fast) to retrieve, to modify, to add and to delete a piece of information.
|
|
Anything else is
|
|
.UL accessory .
|
|
|
|
Universities all around the world teach about Structured Query Language (SQL) and relational databases.
|
|
|
|
The main idea of relational databases is to put data into
|
|
.I tables ,
|
|
with typed columns so the database can optimize operations and storage.
|
|
A database is a list of tables with relations between them.
|
|
For example, let's imagine a database of a
|
|
.I table
|
|
can contain a list of users (their age, height, job, etc.).
|
|
When another
|
|
|
|
The SQL language enables arbitrary operations on databases: add, modify and delete entries.
|
|
Furthermore, SQL enables even to manage administrative operations of the databases themselves: managing users with fine-grained authorizations, creating databases and tables, etc.
|
|
|
|
Many tools were used or even developed over the years specifically to aleviate the inherent complexity and limitations of SQL.
|
|
For example, Unified Modeling Language (UML) is used to design databases by providing a graphical overview of the relations between tables.
|
|
SQL databases can be scripted to automate operations and provide a massive speed up to the operations (
|
|
.I "stored procedures" ,
|
|
see
|
|
.I "PL/SQL" ),
|
|
etc.
|
|
|
|
Document-oriented databases are key-value stores.
|
|
Furthermore, metadata is extracted for further optimization.
|
|
|
|
Contrary to SQL, DODB has a very narrow scope: to provide
|
|
Thus, DODB doesn't provide an interactive shell, no request language to perform arbitrary operations on the database, etc.
|
|
|
|
.SECTION Basic usage
|
|
.SECTION A few more options
|
|
.SECTION Limits of DODB
|
|
.SECTION Experimental scenario
|
|
.LP
|
|
The following experiment shows the performance of DODB based on quering durations.
|
|
Data can be searched via
|
|
.I indexes ,
|
|
as for SQL databases.
|
|
Three possible indexes exist in DODB:
|
|
(a) basic indexes, representing 1 to 1 relations, the document's attribute is related to a value and each value of this attribute is unique,
|
|
(b) partitions, representing 1 to n relations, the attribute has a value and this value can be shared by other documents,
|
|
(c) tags, representing n to n relations, enabling the attribute to have multiple values whose are shared by other documents.
|
|
|
|
The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
|
|
Loop and repeat.
|
|
|
|
Four instances of DODB are tested:
|
|
.BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory) ;
|
|
.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes ;
|
|
.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*] ;
|
|
.BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
|
|
The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
|
|
.ENDBULLET
|
|
.FOOTNOTE1
|
|
Having a cached database will probably be the most widespread use of DODB.
|
|
When memory isn't scarce, there is no point not using it to achieve better performance.
|
|
.FOOTNOTE2
|
|
|
|
The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
|
|
.I disk
|
|
is actually a
|
|
.I "temporary file-system (tmpfs)"
|
|
to enable maximum efficiency.
|
|
.FOOTNOTE1
|
|
A very simple $50 PC, buyed online.
|
|
Nothing fancy.
|
|
.FOOTNOTE2
|
|
|
|
The library is written in Crystal and so is the benchmark (\f[CW]spec/benchmark-cars.cr\f[]).
|
|
Nonetheless, despite a few technicalities, the objective of this document is to provide an insight on the approach used in DODB more than this particular implementation.
|
|
|
|
The manipulated data type can be found in \f[CW]spec/db-cars.cr\f[].
|
|
.SOURCE Ruby ps=9 vs=9p
|
|
class Car
|
|
property name : String # 1-1 relation
|
|
property color : String # 1-n relation
|
|
property keywords : Array(String) # n-n relation
|
|
end
|
|
.SOURCE
|
|
.
|
|
.SS Basic indexes (1 to 1 relations)
|
|
.LP
|
|
An index enables to match a single value based on a small string.
|
|
In our example, each \f[CW]car\f[] has an unique \fIname\f[] which is used as an index.
|
|
|
|
The following graph represents the result of 100 queries of a car based on its name.
|
|
The experiment starts with a database containing 1,000 cars and goes up to 250,000 cars.
|
|
|
|
.so graph_query_index.grap
|
|
|
|
Since there is only one value to retrieve, the request is quick and time is almost constant.
|
|
When the value and the index are kept in memory (see \f[CW]RAM only\f[] and \f[CW]Cached db\f[]), the retrieval is almost instantaneous (about 50 to 120 ns).
|
|
In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db, cached index\f[]).
|
|
The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.
|
|
|
|
.TS
|
|
allbox tab(:);
|
|
c | lw(4.0i) | cew(1.4i).
|
|
DODB instance:Comment and database usage:T{
|
|
compared to RAM only
|
|
T}
|
|
RAM only:T{
|
|
Worst memory footprint (all data must be in memory), best performance.
|
|
T}:-
|
|
Cached db and index:T{
|
|
Performance for retrieving a value is the same as RAM only while
|
|
enabling the admin to manually search for data on-disk.
|
|
T}:about the same perfs
|
|
Uncached db, cached index::300 to 400x slower
|
|
Uncached db and index:T{
|
|
Best memory footprint, worst performance.
|
|
T}:400 to 500x slower
|
|
.TE
|
|
|
|
.B Conclusion :
|
|
as expected, retrieving a single value is fast and the size of the database doesn't matter much.
|
|
Each deserialization and, more importantly, each disk access is a pain point.
|
|
Caching the value enables a massive performance gain, data can be retrieved several hundred times quicker.
|
|
.bp
|
|
.SS Partitions (1 to n relations)
|
|
.LP
|
|
|
|
.so graph_query_partition.grap
|
|
|
|
.bp
|
|
.SS Tags (n to n relations)
|
|
.LP
|
|
.so graph_query_tag.grap
|