diff --git a/paper/paper.ms b/paper/paper.ms index b017e8d..09265d8 100644 --- a/paper/paper.ms +++ b/paper/paper.ms @@ -167,18 +167,28 @@ the feature. Even with that motto, the tool still is expected to be convenient for most applications. .FOOTNOTE2 -This document will provide an extensive documentation on how DODB works and how to use it. -The presented code is in Crystal such as the DODB library for now, but keep in mind that this document is all about the method more that the actual implementation, anyone could implement the exact same library in almost every other language. -Limitations are also clearly stated in a dedicated section. -A few experiments are described to provide an overview of the performance you can expect from this approach. -Finally, a conclusion is drawn based on a real-world usage of this library. +Section 2 provides an extensive documentation on how DODB works and how to use it. +This section also presents the concept of "triggers" (automatic actions on database modification). +Section 3 introduces caches in both the database and triggers. +Section 4 presents the Common database, an implementation of DODB that should be relevant for most applications. +Section 5 presents the RAM-only database, for short-lived (temporary) data. +Section 6 is about memory-constrained environments. +Section 7 presents a few experiments to provide an overview of the performance you can expect from this approach. +Section 8 describes the limitations of DODB and its current implementation. +Section 9 presents the related work, alternative approaches and implementations. +Section 10 lays out future work on this project. +Section 11 presents a real-world usage of DODB. +Finally, section 12 provides a conclusion. . .SECTION How DODB works and basic usage DODB is a hash table. The key of the hash is an auto-incremented number and the value is the stored data. The following section will explain how to use DODB for basic cases including the few added mechanisms to speed-up searches. Also, the file-system representation of the data will be presented since it enables easy off-application searches. -. + +The presented code is in Crystal such as the DODB library. +Keep in mind that this document is all about the method more than the current implementation. +Anyone could implement the exact same library in almost every other language. . .SS Before starting: the example database First things first, the following code is the structure used in the rest of the document to present the different aspects of DODB. @@ -193,7 +203,6 @@ class Car end .SOURCE . -. .SS DODB basic usage Let's create a DODB database for our cars. .SOURCE Ruby ps=9 vs=10 @@ -482,13 +491,18 @@ Also, this can be as easily hidden in a very nice user-friendly command. . .SSS Side note about triggers DODB presents a few possible triggers (basic indexes, partitions and tags) which respond to an obvious need for fast searches and retrevial. -Though, their implementation via the creation of symlinks is the result of a certain vision about how a database should behave in order to provide a practical way for users to play with the entries outside the application. -The implementation can be completely changed. +Though, the implementation involving an heavy use of the file-system via the creation of symlinks comes from a certain vision about how a database could behave to provide a practical way for users to query the database +.UL "outside the application" . Other kinds of triggers could .B easily be implemented in addition of those presented. -The new triggers may have completely different objectives than providing a file-system representation of the data. +These new triggers may have completely different objectives\*[*], methods and performance. +.FOOTNOTE1 +Providing a file-system representation of the data is a fun experiment; +sysadmins can have a playful relation with the database thanks to an unconventional representation of the data. +On the other hand, new triggers could seek to improve performance by any means necessary including the gazillion ways which already exist. +.FOOTNOTE2 The following sections will precisely cover this aspect. . . @@ -512,7 +526,7 @@ A cached database has the same API as the other DODB databases and keeps a copy # Create a cached database database = DODB::Storage::Cached(Car).new "path/to/db-cars" .SOURCE -All operations of the +All the operations of the .CLASS Storage::Uncached class are available for .CLASS Storage::Cached . @@ -540,34 +554,37 @@ This strategy is implemented in the .CLASS DODB::Storage::Common database and this section will explain how it works. -Common database implements a simple strategy to keep only relevant values in memory: -caching +Common database implements a +.I "Least Recently Used" +(LRU) cache eviction policy. +The strategy is simple, keeping only the most .I "recently used" -values. -Any value that is requested or added to the database is considered +values in memory. +Added, requested or modified values are considered .I recent . +In case a new value is added to the cache and that the number of entries exceeds the cache size, the least recently used value is evicted, along with its related data from the cache. -.B "How this works" . -Each time a value is added in the database, its key is put as the first element of a list. -In this list, -.B "values are unique" . -Adding a value that is already present in the list is considered as +.B "The Least Recently Used algorithm" . +Each time a value is added in the database, its key is put as the first element of a +.I set +structure (each value is unique). +This set is ordered, the first element being the most recently used. +Adding a value that is already present in the set is considered as .I "using the value" , -thus it is moved at the start of the list. -In case the number of entries exceeds what is allowed, -the least recently used value (the last list entry) is removed, -along with its related data from the cache. +thus it is moved at the start of the set. +In case the number of entries exceeds what is allowed, the least recently used value is therefore the last element of the set. .B "Implementation details" . +The LRU strategy is both simple and can be easily implemented efficiently with a double-linked list and a hash table. The implementation is time-efficient; -the duration of adding a value is almost constant, it doesn't change much with the number of entries. +the time spent adding a value is almost constant, it doesn't change much with the number of entries. This efficiency is a memory tradeoff. All the entries are added to a .B "double-linked list" (to keep track of the order of the added keys) .UL and to a -.B hash +.B "hash table" to perform efficient searches of the keys in the list. Thus, all the nodes are added twice, once in the list, once in the hash. This way, adding, removing and searching for an entry in the list is fast, @@ -592,27 +609,34 @@ Databases are built around the objective to actually .I store data. But sometimes the data has only the same lifetime as the application. -Stop the application and the data itself become irrelevant, which happens in several occasions, for instance when the application keeps track of the connected users. -This case is not covered by traditional databases; this is out-of-scope, short-lived data only is handled within the application. -Yet, since DODB is a library and not a separate application (read: DODB is incredibly faster), this usage of the database can be relevant. -Having the same API to handle both long and short-lived data can be useful. -Moreover, the previously mentioned indexes (basic indexes, partitions and tags) would also work the same way for these short-lived data. -Of course, in this case, the file-system representation may be completely irrelevant. -And for all these reasons, the -.I RAM-only -DODB database and -.I RAM-only -indexes were created. +Stop the application and the data becomes irrelevant. +This happens in several occasions, for example when the application keeps track of the connected users. +This case is not covered by traditional databases; this is out-of-scope, short-lived data only is handled +.UL within +the application. -Let's recap the advantages of the RAM-only DODB database. +Since DODB is a library and not a separate application, providing a way to handle this usage of the database can be relevant. +Having the same API to handle both long and short-lived data can be useful. +Moreover, the previously mentioned triggers (basic indexes, partitions and tags) would also work the same way for these short-lived data. +Of course, in this case, the file-system representation may be completely irrelevant. +Therefore, the +.I RAM-only +database and the +.I RAM-only +triggers were created. + +Let's recap the advantages of the RAM-only database. The DODB API is the same for short-lived (read: temporary) and long-lived data. -This includes the same indexes too, so a file-system representation of the current state of the application is possible. -RAM-only also means incredible performances since DODB only is a +This includes the same triggers too, so a file-system representation of the current state of the application is possible. +.I RAM-only +also means incredible performances since DODB only is a .I very small layer over a hash table. +. +. .SS RAM-only database -Instanciate a RAM-only database is as simple as the other options. -Moreover, this database has exactly the same API as the others, thus changing from one to another is painless. +To create a RAM-only database is as simple as the other options since the API is identical to other DODB databases. +Thus, changing from one to another is painless. .QP .SOURCE Ruby ps=9 vs=10 # RAM-only database creation @@ -624,8 +648,10 @@ Yes, the path still is required which may be seen as a quirk but the rationale\* A path is still required despite the database being only in memory because indexes can still be instanciated for the database, and those indexes will require this directory. Also, I worked enough already, leave me alone. .FOOTNOTE2 -.SS RAM-only indexes -Indexes have their RAM-only version. +. +. +.SS RAM-only triggers +Triggers have their RAM-only version. .QP .SOURCE Ruby ps=9 vs=10 # RAM-only basic indexes. @@ -668,11 +694,11 @@ database = DODB::Storage::Uncached(Car).new "path/to/db-cars" .SOURCE .QE -.B "Uncached indexes" . -Cached indexes do not require a large amount of memory since the only stored data is an integer (the +.B "Uncached triggers" . +Caching an index shouldn't require a large amount of memory since the only stored data is an integer (the .I key -of the data). -For that reason, indexes are cached by default. +of the data) and a string, which is also arguably true for partitions and tags (setting aside exceptions). +For that reason, these triggers are cached by default. But for highly memory-constrained environments, the cache can be removed. .QP .SOURCE Ruby ps=9 vs=10 @@ -691,69 +717,6 @@ is exactly the same as the others. .QE . . -.SECTION Limits of DODB -DODB provides basic database operations such as storing, searching, modifying and removing data. -Though, SQL databases have a few -.I properties -enabling a more standardized behavior and may create some expectations towards databases from a general public standpoint. -These properties are called "ACID": atomicity, consistency, isolation and durability. -DODB doesn't fully handle ACID properties. - -DODB doesn't provide -.I atomicity . -Instructions cannot be chained and rollback if one of them fails. - -DODB doesn't handle -.I consistency . -There is currently no mechanism to prevent adding invalid values. - -.I Isolation -is partially taken into account with a locking mechanism preventing race conditions. -Though, parallelism is mostly required to respond to a large number of clients at the same time. -Also, SQL databases require a communication with an inherent latency between the application and the database, slowing down the requests despite the fast algorithms to search for a value within the database. -Parallelism is required for SQL databases because of this latency (at least partially), which doesn't exist with DODB\*[*]. -.FOOTNOTE1 -FYI, the service -.I netlib.re -uses DODB and since the database is fast enough, parallelism isn't required despite enabling more than a thousand requests per second. -.FOOTNOTE2 -With a cache, data is retrieved five hundred times quicker than with a SQL database. -Thus, parallelism is probably not needed but a locking mechanism is provided anyway, just in case; this may be overly simplistic but -.SHINE "good enough" -for most applications. - -.I Durability -is taken into account. -Data is written on disk each time it changes. -Again, this is basic but -.SHINE "good enough" -for most applications. - -.B "Discussion on ACID properties" . -The author of this document sees these database properties as a sort of "fail-safe". -Always nice to have, but not entirely necessary; at least not for every single application. -DODB will provide some form of atomicity and consistency at some point, but nothing fancy nor too advanced. -The whole point of the DODB project is to keep the code simple (almost -.B "stupidly" -simple). -Not handling these properties isn't a limitation of the DODB approach but a choice for this project\*[*]. -.FOOTNOTE1 -Which results from a lack of time, mostly. -.FOOTNOTE2 - -Not handling all the ACID properties within the DODB library doesn't mean they cannot be achieved. -Applications can have these properties, often with just a few lines of code. -They just don't come -.I "by default" -with the library\*[*]. -.FOOTNOTE1 -As a side note, the -.I consistency -property is often taken care of within the application despite being handled by the database, for various reasons. -.FOOTNOTE2 -. -. -. .SECTION Experimental scenario .LP The following experiment shows the performance of DODB based on querying durations. @@ -993,6 +956,78 @@ As a side note, let's keep in mind that requesting several thousand entries in D with SQL (varies from 0.1 to 2 ms on my machine for a single value without a search, just the first available entry). This should help put things into perspective. . +.SECTION Limits of DODB +DODB provides basic database operations such as storing, retrieving, modifying and removing data. +However, DODB doesn't fully handle ACID properties\*[*]: atomicity, consistency, isolation and durability. +This section presents the limits of +.UL "the current implementation" +of DODB. +.FOOTNOTE1 +Traditional SQL databases handle ACID properties and may have created some "expectations" towards databases from a general public standpoint. +.FOOTNOTE2 + +.STARTBULLET +.BULLET +.B Atomicity +isn't handled in DODB. +Instructions cannot be chained and rollback if one of them fails. + +.BULLET +.B Consistency +isn't handled in DODB. +No mechanism prevents invalid values to be added. + +.BULLET +.B Isolation +is partially taken into account with a locking mechanism preventing race conditions when modifying a value. + +This property is inherently related to parallelism, which is mostly required to respond to a large number of clients at the same time. +SQL databases require a communication with an inherent latency between the application and the database, slowing down the requests despite the fast algorithms to search for a value within the database. +Parallelism is required for SQL databases because of this latency (at least partially), which doesn't exist with DODB\*[*]. +.FOOTNOTE1 +FYI, the service +.I netlib.re +uses DODB and since the database is fast enough, parallelism isn't required despite enabling more than a thousand requests per second. +.FOOTNOTE2 +With a cache, data is retrieved five hundred times quicker than with a SQL database. +Thus, parallelism is probably not needed but a locking mechanism is provided anyway, just in case; this may be overly simplistic but +.SHINE "good enough" +for most applications. + +.BULLET +.B Durability +is taken into account. +Data is written on disk each time it changes. +Again, this is basic but +.SHINE "good enough" +for most applications. +.ENDBULLET + +.B "Discussion on ACID properties" . +The author of this document sees these database properties as a sort of "fail-safe". +Always nice to have, but not entirely necessary; at least not for every single application. +DODB will provide some form of atomicity and consistency at some point, but nothing fancy nor too advanced. +The whole point of the DODB project is to keep the code simple (almost +.B "stupidly" +simple). +Not handling these properties isn't a limitation of the DODB approach but a choice for this project\*[*]. +.FOOTNOTE1 +Which results from a lack of time, mostly. +.FOOTNOTE2 + +Not handling all the ACID properties within the DODB library doesn't mean they cannot be achieved. +Applications can have these properties, often with just a few lines of code. +They just don't come +.I "by default" +with the library\*[*]. +.FOOTNOTE1 +As a side note, the +.I consistency +property is often taken care of within the application despite being handled by the database, for various reasons. +.FOOTNOTE2 +. +. +. .SECTION Alternatives Other approaches have been used to store data over the years, including but not limited to SQL and key-value stores. This section briefly presents some of them and their difference from DODB.