214 lines
9.8 KiB
Plaintext
214 lines
9.8 KiB
Plaintext
.so macros.roff
|
|
.TITLE Document Oriented DataBase (DODB)
|
|
.AUTHOR Philippe P.
|
|
.ABSTRACT1
|
|
DODB is a database-as-library, enabling a very simple way to store applications' data: storing serialized
|
|
.I documents
|
|
(basically any data type) in plain files.
|
|
To speed-up searches, attributes of these documents can be used as indexes which leads to create a few symbolic links
|
|
.I symlinks ) (
|
|
on the disk.
|
|
|
|
This document briefly presents DODB and its main differences with other database engines.
|
|
An experiment is described and analysed to understand the performance that can be expected from this approach.
|
|
.ABSTRACT2
|
|
.SINGLE_COLUMN
|
|
.SECTION Introduction to DODB
|
|
A database consists in managing data, enabling queries (preferably fast) to retrieve, to modify, to add and to delete a piece of information.
|
|
Anything else is
|
|
.UL accessory .
|
|
|
|
Universities all around the world teach about Structured Query Language (SQL) and relational databases.
|
|
.
|
|
.de PRIMARY_KEY
|
|
.I \\$1 \\$2 \\$3
|
|
..
|
|
.de FOREIGN_KEY
|
|
.I \\$1 \\$2 \\$3
|
|
..
|
|
|
|
.UL "Relational databases"
|
|
are built around the idea to put data into
|
|
.I tables ,
|
|
with typed columns so the database can optimize operations and storage.
|
|
A database is a list of tables with relations between them.
|
|
For example, let's imagine a database of a movie theater.
|
|
The database will have a
|
|
.I table
|
|
for the list of movies they have
|
|
.PRIMARY_KEY idmovie , (
|
|
title, duration, synopsis),
|
|
a table for the scheduling
|
|
.PRIMARY_KEY idschedule , (
|
|
.FOREIGN_KEY idmovie ,
|
|
.FOREIGN_KEY idroom ,
|
|
time slot),
|
|
a table for the rooms
|
|
.PRIMARY_KEY idroom , (
|
|
name), etc.
|
|
Tables have relations, for example the table "scheduling" has a column
|
|
.I idmovie
|
|
which points to entries in the "movie" table.
|
|
|
|
.UL "The SQL language"
|
|
enables arbitrary operations on databases: add, search, modify and delete entries.
|
|
Furthermore, SQL also enables to manage administrative operations of the databases themselves: creating databases and tables, managing users with fine-grained authorizations, etc.
|
|
This language is used in applications to perform operations on the database, binding the code with the database.
|
|
SQL is also used
|
|
.UL outside
|
|
the application, by admins for managing databases and potentially by some technical users to retrieve some data without a dedicated interface\*[*].
|
|
.FOOTNOTE1
|
|
One of the first objectives of SQL was to enable a class of
|
|
.I non-developer
|
|
users to talk directly to the database so they can access the data without bothering the developers.
|
|
.FOOTNOTE2
|
|
|
|
Many tools were used or even developed over the years specifically to aleviate the inherent complexity and limitations of SQL.
|
|
For example, designing databases becomes difficult when the list of tables grows;
|
|
Unified Modeling Language (UML) is then used to provide a graphical overview of the relations between tables.
|
|
SQL databases may be fast to retrieve data despite complicated operations, but when multiple sequential operations are required they become slow because of all the back-and-forths with the application;
|
|
thus, SQL databases can be scripted to automate operations and provide a massive speed up
|
|
.I "stored procedures" , (
|
|
see
|
|
.I "PL/SQL" ).
|
|
Writing SQL requests requires a lot of boiletplate since there is no integration in the programming languages, leading to multiple function calls for any operation on the database;
|
|
thus, object-relational mapping (ORM) libraries were created to reduce the massive code duplication.
|
|
And so on.
|
|
|
|
For many reasons, SQL is not a silver bullet to
|
|
.I solve
|
|
the database problem.
|
|
The encountered difficulties mentioned above and the original objectives of SQL not being universal\*[*], other database designs were created\*[*].
|
|
.FOOTNOTE1
|
|
To say the least!
|
|
Not everyone needs to let users access the database without going through the application.
|
|
For instance, writing a \f[I]blog\f[] for a small event or to share small stories about your life doesn't require manual operations on the database, fortunately.
|
|
.FOOTNOTE2
|
|
.FOOTNOTE1
|
|
A lot of designs won't be mentioned here.
|
|
The actual history of databases is often quite unclear since the categories of databases are sometimes vague, underspecified.
|
|
As mentioned, SQL is not a silver bullet and a lot of developers shifted towards other solutions, that's the important part.
|
|
.FOOTNOTE2
|
|
The NoSQL movement started because the stated goals of many actors from the early Web boom were different from SQL.
|
|
The need for very fast operations far exceeded what was practical at the moment with SQL.
|
|
This led to the use of more basic methods to manage data such as
|
|
.I "key-value stores" ,
|
|
which simply associate a value with an
|
|
.I index
|
|
for fast retrieval.
|
|
In this case, there is no need for the database to have
|
|
.I tables ,
|
|
data may be untyped, the entries may even have different attributes.
|
|
Since homogeneity is not necessary anymore, databases have fewer (or different) constraints.
|
|
Document-oriented databases are a sub-class of key-value stores, where metadata can be extracted from the entries for further optimizations.
|
|
And that's exactly what is being done in Document Oriented DataBase (DODB).
|
|
|
|
Contrary to SQL, DODB has a very narrow scope: to provide a library enabling to store, retrieve, modify and delete data.
|
|
In this way, DODB transforms any application in a database manager.
|
|
DODB doesn't provide an interactive shell, there is no request language to perform arbitrary operations on the database, no statistical optimizations of the requests based on query frequencies, etc.
|
|
Instead, DODB reduces the complexity of the infrastructure, stores data in plain files and enables simple manual scripting with widespread unix tools.
|
|
Simplicity is key.
|
|
.
|
|
.SECTION Basic usage
|
|
.
|
|
.SECTION A few more options
|
|
.
|
|
.SECTION Limits of DODB
|
|
.
|
|
.SECTION Experimental scenario
|
|
.LP
|
|
The following experiment shows the performance of DODB based on quering durations.
|
|
Data can be searched via
|
|
.I indexes ,
|
|
as for SQL databases.
|
|
Three possible indexes exist in DODB:
|
|
(a) basic indexes, representing 1 to 1 relations, the document's attribute is related to a value and each value of this attribute is unique,
|
|
(b) partitions, representing 1 to n relations, the attribute has a value and this value can be shared by other documents,
|
|
(c) tags, representing n to n relations, enabling the attribute to have multiple values whose are shared by other documents.
|
|
|
|
The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
|
|
Loop and repeat.
|
|
|
|
Four instances of DODB are tested:
|
|
.BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory) ;
|
|
.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes ;
|
|
.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*] ;
|
|
.BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
|
|
The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
|
|
.ENDBULLET
|
|
.FOOTNOTE1
|
|
Having a cached database will probably be the most widespread use of DODB.
|
|
When memory isn't scarce, there is no point not using it to achieve better performance.
|
|
.FOOTNOTE2
|
|
|
|
The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
|
|
.I disk
|
|
is actually a
|
|
.I "temporary file-system (tmpfs)"
|
|
to enable maximum efficiency.
|
|
.FOOTNOTE1
|
|
A very simple $50 PC, buyed online.
|
|
Nothing fancy.
|
|
.FOOTNOTE2
|
|
|
|
The library is written in Crystal and so is the benchmark (\f[CW]spec/benchmark-cars.cr\f[]).
|
|
Nonetheless, despite a few technicalities, the objective of this document is to provide an insight on the approach used in DODB more than this particular implementation.
|
|
|
|
The manipulated data type can be found in \f[CW]spec/db-cars.cr\f[].
|
|
.SOURCE Ruby ps=9 vs=9p
|
|
class Car
|
|
property name : String # 1-1 relation
|
|
property color : String # 1-n relation
|
|
property keywords : Array(String) # n-n relation
|
|
end
|
|
.SOURCE
|
|
.
|
|
.SS Basic indexes (1 to 1 relations)
|
|
.LP
|
|
An index enables to match a single value based on a small string.
|
|
In our example, each \f[CW]car\f[] has an unique \fIname\f[] which is used as an index.
|
|
|
|
The following graph represents the result of 100 queries of a car based on its name.
|
|
The experiment starts with a database containing 1,000 cars and goes up to 250,000 cars.
|
|
|
|
.so graph_query_index.grap
|
|
|
|
Since there is only one value to retrieve, the request is quick and time is almost constant.
|
|
When the value and the index are kept in memory (see \f[CW]RAM only\f[] and \f[CW]Cached db\f[]), the retrieval is almost instantaneous (about 50 to 120 ns).
|
|
In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db, cached index\f[]).
|
|
The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.
|
|
|
|
.TS
|
|
allbox tab(:);
|
|
c | lw(4.0i) | cew(1.4i).
|
|
DODB instance:Comment and database usage:T{
|
|
compared to RAM only
|
|
T}
|
|
RAM only:T{
|
|
Worst memory footprint (all data must be in memory), best performance.
|
|
T}:-
|
|
Cached db and index:T{
|
|
Performance for retrieving a value is the same as RAM only while
|
|
enabling the admin to manually search for data on-disk.
|
|
T}:about the same perfs
|
|
Uncached db, cached index::300 to 400x slower
|
|
Uncached db and index:T{
|
|
Best memory footprint, worst performance.
|
|
T}:400 to 500x slower
|
|
.TE
|
|
|
|
.B Conclusion :
|
|
as expected, retrieving a single value is fast and the size of the database doesn't matter much.
|
|
Each deserialization and, more importantly, each disk access is a pain point.
|
|
Caching the value enables a massive performance gain, data can be retrieved several hundred times quicker.
|
|
.bp
|
|
.SS Partitions (1 to n relations)
|
|
.LP
|
|
|
|
.so graph_query_partition.grap
|
|
|
|
.bp
|
|
.SS Tags (n to n relations)
|
|
.LP
|
|
.so graph_query_tag.grap
|