From 02e7e82fa1605983ce5ca9f1cc076d413ffd5e47 Mon Sep 17 00:00:00 2001 From: Philippe PITTOLI Date: Wed, 29 May 2024 03:56:11 +0200 Subject: [PATCH] Paper --- paper/bibliography | 6 ++++++ paper/paper.ms | 43 +++++++++++++++++++++++++++++++++---------- 2 files changed, 39 insertions(+), 10 deletions(-) diff --git a/paper/bibliography b/paper/bibliography index 075b055..d1ccce4 100644 --- a/paper/bibliography +++ b/paper/bibliography @@ -4,3 +4,9 @@ %T RFC 8949, Concise Binary Object Representation (CBOR) %D 2020 %I Internet Engineering Task Force (IETF) + +%K JSON +%A Tim Bray +%T RFC 8259, The JavaScript Object Notation (JSON) Data Interchange Format +%D 2017 +%I Internet Engineering Task Force (IETF) diff --git a/paper/paper.ms b/paper/paper.ms index c542759..280f47c 100644 --- a/paper/paper.ms +++ b/paper/paper.ms @@ -211,7 +211,14 @@ end When a value is added, it is serialized\*[*] and written in a dedicated file. .FOOTNOTE1 Serialization is currently in JSON. -CBOR is a work-in-progress. +.[ +JSON +.] +CBOR +.[ +CBOR +.] +is a work-in-progress. Nothing binds DODB to a particular format. .FOOTNOTE2 The key of the hash is a number, auto-incremented, used as the name of the stored file. @@ -896,16 +903,18 @@ Three possible indexes exist in DODB: The scenario is simple: adding values to a database with indexes (basic, partitions and tags) then query 100 times a value based on the different indexes. Loop and repeat. -Four instances of DODB are tested: +Five instances of DODB are tested: .BULLET \fIuncached database\f[] shows the achievable performance with a strong memory constraint (nothing can be kept in-memory); -.BULLET \fIuncached data but cached index\f[] shows the improvement you can expect by having a cache on indexes; -.BULLET \fIcached database\f[] shows the most basic use of DODB\*[*]; +.BULLET \fIuncached database but cached index\f[] shows the improvement you can expect by having a cache on indexes; +.BULLET \fIcommon database\f[] shows the most basic use of DODB, with a limited cache (100k entries)\*[*]; +.BULLET \fIcached database\f[] represents a database will all the entries in cache (no eviction mechanism); .BULLET \fIRAM only\f[], the database doesn't have a representation on disk (no data is written on it). The \fIRAM only\f[] instance shows a possible way to use DODB: to keep a consistent API to store data, including in-memory data with a lifetime related to the application's. .ENDBULLET .FOOTNOTE1 Having a cached database will probably be the most widespread use of DODB. When memory isn't scarce, there is no point not using it to achieve better performance. +Moreover, the "common database" enables to configure the cache size, so this database is relevant even when the data-set is bigger than the available memory. .FOOTNOTE2 The computer on which this test is performed\*[*] is a AMD PRO A10-8770E R7 (4 cores), 2.8 GHz.When mentioned, the @@ -955,26 +964,40 @@ This is slightly more (about 200 ns) for Common database since there is a few mo In case the value is on the disk, deserialization takes about 15 µs (see \f[CW]Uncached db\f[]). The request is a little longer when the index isn't cached (see \f[CW]Uncached db and index\f[]); in this case DODB walks the file-system to find the right symlink to follow, thus slowing the process even more, by up to 20%. -The logarithmic scale version of this figure shows that RAM-only and Cached databases have exactly the same performance. -The Common database is somewhat slower than these two due to the caching policy: when a value is asked, the Common database puts its key at the start of a list to represent a +The logarithmic scale version of this figure shows that \fIRAM-only\f[] and \fIcached\f[] databases have exactly the same performance. +The \fIcommon\f[] database is somewhat slower than these two due to the caching policy: when a value is asked, the \fIcommon\f[] database puts its key at the start of a list to represent a .I recent use of this data (respectively, the last values in this list are the least recently used entries). -Thus, Common database takes 80 ns for its caching policy, which makes this database about 67% slower than the previous ones to retrieve a value. +Thus, the \fIcommon\f[] database takes 80 ns for its caching policy, which makes this database about 67% slower than the previous ones to retrieve a value. Uncached databases are far away from these results, as shown by the logarithmically scaled figure. -The data cache improves the duration of the requests, this makes them at least a hundred times faster. +The data cache improves the duration of the requests, this makes them at least 170 times faster. The results depend on the data size; the bigger the data, the slower the serialization (and deserialization). +In this example, the database entries are almost empty; they have very few attributes and not much content (a few dozen characters max). +Thus, performance of non-cached databases will be even more severely impacted with real-world data. That is why alternative encodings, such as CBOR, .[ CBOR .] should be considered for large databases. - .SS Partitions (1 to n relations) -.LP +The previous example shown the retrieval of a single value from the database. +The following will show what happens when thousands of entries are retrieved. + +A partition index enables to match a list of entries based on an attribute. +In the experiment, a database of cars is created along with a partition on their color. +Performance is analyzed based the partition size (the number of red cars) and the duration to retrieve all the entries. + .ps -2 .so graphs/query_partition.grap .ps \n[PS] +.QP +This figure shows the retrieval of cars based on a partition (their color), with both a linear and a logarithmic scale. +.QE +In this example, both the linear and the logarithmic scales are represented to better grasp the difference between all databases. +The linear scale shows the linearity of the request time for uncached databases. +Respectively, the logarithmically scaled figure does the same for cached databases, +which are flattened in the linear scale since they all are hundreds of time quicker than the uncached ones. .SS Tags (n to n relations) .LP