From 02e7e82fa1605983ce5ca9f1cc076d413ffd5e47 Mon Sep 17 00:00:00 2001
From: Philippe PITTOLI <karchnu@karchnu.fr>
Date: Wed, 29 May 2024 03:56:11 +0200
Subject: [PATCH] Paper

---
 paper/bibliography |  6 ++++++
 paper/paper.ms     | 43 +++++++++++++++++++++++++++++++++----------
 2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/paper/bibliography b/paper/bibliography
index 075b055..d1ccce4 100644
--- a/paper/bibliography
+++ b/paper/bibliography
@@ -4,3 +4,9 @@
 %T RFC 8949, Concise Binary Object Representation (CBOR)
 %D 2020
 %I Internet Engineering Task Force (IETF)
+
+%K JSON
+%A Tim Bray
+%T RFC 8259, The JavaScript Object Notation (JSON) Data Interchange Format
+%D 2017
+%I Internet Engineering Task Force (IETF)
diff --git a/paper/paper.ms b/paper/paper.ms
index c542759..280f47c 100644
--- a/paper/paper.ms
+++ b/paper/paper.ms
@@ -211,7 +211,14 @@ end
 When a value is added, it is serialized\*[*] and written in a dedicated file.
 .FOOTNOTE1
 Serialization is currently in JSON.
-CBOR is a work-in-progress.
+.[
+JSON
+.]
+CBOR
+.[
+CBOR
+.]
+is a work-in-progress.
 Nothing binds DODB to a particular format.
 .FOOTNOTE2
 The key of the hash is a number, auto-incremented, used as the name of the stored file.
@@ -896,16 +903,18 @@ Three possible indexes exist in DODB:
 The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes.
 Loop and repeat.
 
-Four instances of DODB are tested:
+Five instances of DODB are tested:
 .BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory);
-.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes;
-.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*];
+.BULLET \fIuncached database but cached index\f[] shows the improvement you can expect by having a cache on indexes;
+.BULLET \fIcommon database\f[] shows the most basic use of DODB, with a limited cache (100k entries)\*[*];
+.BULLET \fIcached database\f[] represents a database will all the entries in cache (no eviction mechanism);
 .BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it).
 The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's.
 .ENDBULLET
 .FOOTNOTE1
 Having a cached database will probably be the most widespread use of DODB.
 When memory isn't scarce, there is no point not using it to achieve better performance.
+Moreover, the "common database" enables to configure the cache size, so this database is relevant even when the data-set is bigger than the available memory.
 .FOOTNOTE2
 
 The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the
@@ -955,26 +964,40 @@ This is slightly more (about 200 ns) for Common database since there is a few mo
 In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db\f[]).
 The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%.
 
-The logarithmic scale version of this figure shows that RAM-only and Cached databases have exactly the same performance.
-The Common database is somewhat slower than these two due to the caching policy: when a value is asked, the Common database puts its key at the start of a list to represent a
+The logarithmic scale version of this figure shows that \fIRAM-only\f[] and \fIcached\f[] databases have exactly the same performance.
+The \fIcommon\f[] database is somewhat slower than these two due to the caching policy: when a value is asked, the \fIcommon\f[] database puts its key at the start of a list to represent a
 .I recent
 use of this data (respectively, the last values in this list are the least recently used entries).
-Thus, Common database takes 80 ns for its caching policy, which makes this database about 67% slower than the previous ones to retrieve a value.
+Thus, the \fIcommon\f[] database takes 80 ns for its caching policy, which makes this database about 67% slower than the previous ones to retrieve a value.
 Uncached databases are far away from these results, as shown by the logarithmically scaled figure.
-The data cache improves the duration of the requests, this makes them at least a hundred times faster.
+The data cache improves the duration of the requests, this makes them at least 170 times faster.
 
 The results depend on the data size; the bigger the data, the slower the serialization (and deserialization).
+In this example, the database entries are almost empty; they have very few attributes and not much content (a few dozen characters max).
+Thus, performance of non-cached databases will be even more severely impacted with real-world data.
 That is why alternative encodings, such as CBOR,
 .[
 CBOR
 .]
 should be considered for large databases.
-
 .SS Partitions (1 to n relations)
-.LP
+The previous example shown the retrieval of a single value from the database.
+The following will show what happens when thousands of entries are retrieved.
+
+A partition index enables to match a list of entries based on an attribute.
+In the experiment, a database of cars is created along with a partition on their color.
+Performance is analyzed based the partition size (the number of red cars) and the duration to retrieve all the entries.
+
 .ps -2
 .so graphs/query_partition.grap
 .ps \n[PS]
+.QP
+This figure shows the retrieval of cars based on a partition (their color), with both a linear and a logarithmic scale.
+.QE
+In this example, both the linear and the logarithmic scales are represented to better grasp the difference between all databases.
+The linear scale shows the linearity of the request time for uncached databases.
+Respectively, the logarithmically scaled figure does the same for cached databases,
+which are flattened in the linear scale since they all are hundreds of time quicker than the uncached ones.
 
 .SS Tags (n to n relations)
 .LP