comments + paper

2024-05-27 04:58:32 +02:00 · 2024-05-27 04:58:32 +02:00 · e152bc0ee7
commit e152bc0ee7
parent c01ec614ae
4 changed files with 87 additions and 36 deletions
--- a/paper/paper.ms
+++ b/paper/paper.ms
@ -488,6 +488,56 @@ class are available for
 Since indexes do not require nearly as much memory as caching the entire database, they are cached by default.
 .
 .
+.
+.SECTION Common database: caching only recently used data
+Storing the entire data-set in memory is an effective way to make the requests fast, as does
+the
+.I "cached database"
+presented in the previous section.
+Not all data-sets are compatible with this approach, for obvious reasons.
+Thus, a tradeoff could be found to enable fast retrieval of data without requiring much memory.
+Caching only a part of the data-set could already enable a massive speed-up even in memory-constrained environments.
+The most effective strategy could differ from an application to another, providing a generic algorithm that should work for all possible constraints is an hazardous endeavor.
+However, caching only the most recently requested values is a simple policy which may be efficient in many cases.
+This strategy is implemented in
+.I "common database"
+and this section will explain how it works.
+
+Common database implements a simple strategy to keep only relevant values in memory:
+caching
+.I "recently used"
+values.
+Any value that is requested or added to the database is considered
+.I recent .
+
+.B "How this works" .
+Each time a value is added in the database, its key is put as the first element of a list.
+In this list,
+.B "values are unique" .
+Adding a value that is already present in the list is considered as "using the value",
+thus it is moved at the start of the list.
+In case the number of entries exceeds what is allowed,
+the least recently used value (the last list entry) is removed,
+along with its related data from the cache.
+
+.B "Implementation details" .
+The implementation is time-efficient;
+the duration of adding a value is constant, it doesn't change with the number of entries.
+This efficiency is a memory tradeoff.
+All the entries are added to a
+.B "double-linked list"
+(to keep track of the order of the added keys) and to a
+.B hash
+to perform efficient searches of the keys in the list.
+Thus, all the nodes are added twice, once in the list, once in the hash.
+This way, adding, removing and searching for an entry in the list is fast,
+no matter the size of the list.
+
+Moreover,
+.I "common database"
+enables to adjust the number of stored entries.
+.
+.
 .SECTION RAM-only database for short-lived data
 Databases are built around the objective to actually
 .I store
@ -911,13 +961,6 @@ Caching the value enables a massive performance gain, data can be retrieved seve
 .SECTION Future work
 This section presents all the features I want to see in a future version of the DODB library.
 .
-.SS Cached database and indexes with selective memory
-Right now, both cached database and cached indexes will store any cached value indefinitively.
-Giving the cache the ability to select the values to keep in memory would enable a massive speed-up even in memory-constrained environments.
-The policy could be as simple as keeping in memory only the most recently requested values.
-
-These new versions of cached database and indexes will become the standard, default DODB behavior.
-.
 .SS Pagination via the indexes: offset and limit
 Right now, browsing the entire database by requesting a limited list at a time is possible, thanks to some functions accepting an
 .I offset
--- a/spec/benchmark-cars.cr
+++ b/spec/benchmark-cars.cr
@ -12,6 +12,7 @@ require "./db-cars.cr"
 # ENV["REPORT_DIR"]       rescue "results"
 # ENV["NBRUN"]            rescue 100
 # ENV["MAXINDEXES"]       rescue 5_000
+# ENV["FIFO_SIZE"]        rescue 10_000

 class Context
 	class_property report_dir = "results"
@ -20,6 +21,7 @@ class Context
 	class_property from        = 1_000
 	class_property to          = 50_000
 	class_property incr        = 1_000
+	class_property fifo_size   = 10_000
 end

 # To simplify the creation of graphs, it's better to have fake data for
@ -101,7 +103,7 @@ end
 def bench_searches()
 	cars_ram      = SPECDB::RAMOnly(Car).new
 	cars_cached   = SPECDB::Cached(Car).new
-	cars_fifo     = SPECDB::FIFO(Car).new "", 5000 # With only 5_000 entries
+	cars_fifo     = SPECDB::Common(Car).new "-#{Context.fifo_size}", Context.fifo_size
 	cars_semi     = SPECDB::Uncached(Car).new "-semi"
 	cars_uncached = SPECDB::Uncached(Car).new

@ -134,7 +136,7 @@ end
 def bench_add()
 	cars_ram      = SPECDB::RAMOnly(Car).new
 	cars_cached   = SPECDB::Cached(Car).new
-	cars_fifo     = SPECDB::FIFO(Car).new "", 5_000
+	cars_fifo     = SPECDB::Common(Car).new "-#{Context.fifo_size}", Context.fifo_size
 	cars_semi     = SPECDB::Uncached(Car).new "-semi"
 	cars_uncached = SPECDB::Uncached(Car).new

@ -166,9 +168,9 @@ def bench_add()
 end

 def bench_50_shades_of_fifo()
-	cars_fifo1 = SPECDB::FIFO(Car).new "",  1_000
-	cars_fifo5 = SPECDB::FIFO(Car).new "",  5_000
-	cars_fifo10 = SPECDB::FIFO(Car).new "", 10_000
+	cars_fifo1 = SPECDB::Common(Car).new "-1k",  1_000
+	cars_fifo5 = SPECDB::Common(Car).new "-5k",  5_000
+	cars_fifo10 = SPECDB::Common(Car).new "-10k", 10_000

 	fifo_Sby_name1,   fifo_Sby_color1,   fifo_Sby_keywords1   = cached_indexes cars_fifo1
 	fifo_Sby_name5,   fifo_Sby_color5,   fifo_Sby_keywords5   = cached_indexes cars_fifo5
@ -189,6 +191,7 @@ ENV["NBRUN"]?.try            { |it| Context.nb_run      = it.to_i }
 ENV["DBSIZE"]?.try           { |it| Context.to          = it.to_i }
 ENV["DBSIZE_START"]?.try     { |it| Context.from        = it.to_i }
 ENV["DBSIZE_INCREMENT"]?.try { |it| Context.incr        = it.to_i }
+ENV["FIFO_SIZE"]?.try        { |it| Context.fifo_size   = it.to_i }

 pp! Context.nb_run
 pp! Context.from
--- a/src/dodb/storage/common.cr
+++ b/src/dodb/storage/common.cr
@ -1,8 +1,8 @@
-# Common database: only recently requested entries are kept in memory.
+# Common database: only **recently added or requested** entries are kept in memory.
 #
 # Least recently used entries may be removed from the cache in order to keep the amount of memory used reasonable.
 #
-# The number of entries to keep in memory is configurable.
+# The number of entries to keep in memory is **configurable**.
 #
 # This database is relevant for high demand applications;
 # which means both a high number of entries (data cannot fit entirely in RAM),
@ -33,9 +33,9 @@
 #
 # NOTE: fast for frequently requested data and requires a stable (and configurable) amount of memory.
 class DODB::Storage::Common(V) < DODB::Storage::Cached(V)
-	# The *fifo* a simple FIFO instance where the key of the requested data is pushed.
+	# The *fifo* an `EfficientFIFO` instance where the key of the requested data is pushed.
 	# In case the number of stored entries exceeds what is allowed, the least recently used entry is removed.
-	property fifo : FIFO(Int32)
+	property fifo : EfficientFIFO(Int32)

 	# Initializes the `DODB::Storage::Common` database with a maximum number of entries in the cache.
 	def initialize(@directory_name : String, max_entries : UInt32)
--- a/src/fifo.cr
+++ b/src/fifo.cr
@ -1,12 +1,14 @@
 require "./list.cr"

-# This class enables to keep track of used data.
+# This class is a simpler implementation of `EfficientFIFO`, used to implement an eviction policy for data cache
+# for `DODB::Storage::Common`.
+# It enables to keep track of recently used data.
 #
-# Each time a value is added, it is put in a FIFO structure.
-# Adding a value several times is considered as "using the value",
-# so it is pushed back at the entry of the FIFO (as a new value).
-# In case the number of entries exceeds what is allowed,
-# the least recently used value is removed.
+# **How this works**.
+# Each time a value is added in the database, its key is put in this "FIFO" structure.
+# In this structure, **values are unique** and adding a value several times is considered as "using the value",
+# so it is pushed back at the entry of the FIFO structure, as a new value.
+# In case the number of entries exceeds what is allowed, the least recently used value is removed.
 # ```
 # fifo = FIFO(Int32).new 3 # Only 3 allowed entries.
 #
@ -20,7 +22,8 @@ require "./list.cr"
 # ```
 #
 # The number of entries in the FIFO structure is configurable.
-# WARNING: this implementation becomes slow very fast, but doesn't cost much memory.
+# WARNING: this implementation becomes slow very fast (0(n) complexity), but doesn't cost much memory.
+# WARNING: this *FIFO* class doesn't allow the same value multiple times.
 class FIFO(V)
 	# This array is used as the *fifo structure*.
 	property data : Array(V)
@ -47,19 +50,14 @@ class FIFO(V)
 	end
 end

-# This class enables to keep track of used data.
+# This class is used to implement a cache policy for `DODB::Storage::Common`.
+# It enables to keep track of recently used data.
 #
-# **Implementation details.**
-# Contrary to the `FIFO` class, this implementation is time-efficient.
-# However, this efficiency is a memory tradeoff: all the entries are added to a double-linked list to keep
-# track of the order **and** to a hash to perform efficient searches of the values in the double-linked list.
-# Thus, all the nodes are added twice, once in the list, once in the hash.
-#
-# Each time a value is added, it is put in a FIFO structure.
-# Adding a value several times is considered as "using the value",
-# so it is pushed back at the entry of the FIFO (as a new value).
-# In case the number of entries exceeds what is allowed,
-# the least recently used value is removed.
+# **How this works**.
+# Each time a value is added in the database, its key is put in this "FIFO" structure.
+# In this structure, **values are unique** and adding a value several times is considered as "using the value",
+# so it is pushed back at the entry of the FIFO structure, as a new value.
+# In case the number of entries exceeds what is allowed, the least recently used value is removed.
 # ```
 # fifo = EfficientFIFO(Int32).new 3 # Only 3 allowed entries.
 #
@ -72,10 +70,17 @@ end
 # pp! fifo << 5 # -> 3 (least recently used data)
 # ```
 #
+# **Implementation details.**
+# Contrary to the `FIFO` class, this implementation is time-efficient.
+# However, this efficiency is a memory tradeoff: all the entries are added to a double-linked list to keep
+# track of the order **and** to a hash to perform efficient searches of the values in the double-linked list.
+# Thus, all the nodes are added twice, once in the list, once in the hash.
+#
 # The number of entries in the FIFO structure is configurable.
 # NOTE: this implementation is time-efficient, but costs some memory.
 class EfficientFIFO(V)
-	# This array is used as the *fifo structure*.
+	# Both this list and the hash are used as the *fifo structures*.
+	# The list preserves the *order* of the entries while the *hash* enables fast retrieval of entries in the list.
 	property list : DoubleLinkedList(V)
 	property hash : Hash(V, DoubleLinkedList::Node(V))