From c04104dce106b9350cd06b515dc7891630cbd13e Mon Sep 17 00:00:00 2001 From: Philippe Pittoli Date: Sat, 8 Feb 2025 06:18:35 +0100 Subject: [PATCH] Introduction++ --- paper/paper.ms | 198 ++++++++++++++++++++++++++++--------------------- 1 file changed, 114 insertions(+), 84 deletions(-) diff --git a/paper/paper.ms b/paper/paper.ms index e35ff33..878d001 100644 --- a/paper/paper.ms +++ b/paper/paper.ms @@ -81,9 +81,9 @@ Although fairly advanced, this document lacks a few reviews, a bit of discussion A database consists in managing data, enabling queries to add, to retrieve, to modify and to delete a piece of information. These actions are grouped under the acronym CRUD: creation, retrieval, update and deletion. CRUD operations are the foundation for the most basic databases. -Yet, almost every single database engine goes far beyond this minimalistic set of features. +Of course, almost every single database engine goes far beyond this minimalistic set of features. -Although everyone using the filesystem of their computer as some sort of database (based on previous definition) by storing raw data (files) in a hierarchical manner (directories), computer science classes introduce a particularly convoluted way of managing data. +Although everyone is using the filesystem of their computer as some sort of database (based on previous definition) by storing raw data (files) in a hierarchical manner (directories), computer science classes introduce a particularly convoluted way of managing data. Universities all around the world teach about Structured Query Language (SQL) and relational databases. These two concepts are closely interlinked and require a brief explanation. @@ -144,12 +144,13 @@ And so on. For many reasons, SQL is not a silver bullet to .I solve the database problem. -The encountered difficulties mentioned above and the original objectives of SQL not being universal\*[*], other database designs were created\*[*]. +The encountered difficulties mentioned above and the original objectives of SQL not being universal\*[*], .FOOTNOTE1 To say the least! Not everyone needs to let users access the database without going through the application. For instance, writing a \f[I]blog\f[] for a small event or to share small stories about your life doesn't require manual operations on the database, fortunately. .FOOTNOTE2 +other database designs were created\*[*]. .FOOTNOTE1 A lot of designs won't be mentioned here. The actual history of databases is often quite unclear since the categories of databases are sometimes vague, underspecified. @@ -171,22 +172,45 @@ And that's exactly what is being done in Document Oriented DataBase (DODB). .UL "The stated goal of DODB" is to provide a simple and easy-to-use -.UL library -for developers to perform CRUD operations on documents (undescribed data structures). -DODB aims basic to medium-sized projects, up to a few million entries\*[*]. +.UL library \*[*] +for developers to store documents (undescribed data structures). .FOOTNOTE1 +Or as people might call it: +.dq "serverless architecture" . +.FOOTNOTE2 + +.STARTBULLET +.KS +.BULLET +.B Simple , +because the approach is indeed trivial: the database entries are written as simple files in a directory. +This simplicity has a snowballing effect: it only requires a few dozen lines of code. +DODB is implemented in only a thousand lines of code in total, despite including optional features and optimized alternative implementations to make the library efficient and cover most cases. +.KE + +DODB doesn't strive to be minimalistic, but it avoids intermediary language and low-level optimizations. +Storing data is writing a file. +Indexing data is making symbolic links. +It is that simple. + +.KS +.BULLET +.B Easy-to-use , +because the API is high-level and doesn't take any superflous parameter. +Creating a database only requires a path, updating an entry only requires the new version of the entry, and so on. +Everything is designed to be enjoyable for the developers. +.KE +.ENDBULLET + +DODB aims for small and medium-size projects\*[*], up to a few hundred million entries with commodity hardware. +.FOOTNOTE1 +There is no real hard limits but the underlying filesystem, DODB can accept billions of entries. +.br See the section .dq "Limits of DODB" . .FOOTNOTE2 -Code simplicity implies hackability. -Traditional SQL relational databases have a snowballing effect on code complexity, including for applications with basic requirements. -However, DODB may be a great starting point to implement more sophisticated features for creative minds. - -.UL "The non-goals of DODB" -are: -.STARTBULLET -.BULLET to provide a generic library w -.ENDBULLET +Its simplicity (approach and code) makes trivial any modification for specific needs. +DODB may be a great starting point to implement more sophisticated features for creative minds. .UL "Contrary to SQL" , DODB has a very narrow scope: to provide a library enabling to store, to retrieve, to modify and to delete data. @@ -195,8 +219,11 @@ DODB doesn't provide an interactive shell, there is no request language to perfo Instead, DODB reduces the complexity of the infrastructure, stores data in plain files and enables simple manual scripting with widespread unix tools. Simplicity is key. +Traditional SQL relational databases have a snowballing effect on code complexity, even for applications with basic requirements. +Furthermore, data description in tables and relations is not intuitive contrary to storing whole documents which is simply serializing structures used in the code. + .UL "Contrary to other NoSQL databases" , -DODB doesn't provide an application but a library, nothing else. +DODB isn't an application but a library. The idea is to help developers to store their data themselves, not depending on . I yet-another-all-in-one massive tool. @@ -205,7 +232,7 @@ The library writes (and removes) data on a storage device, has a few retrieval a The lack of features .I is the feature. -Even with that motto, the tool still is expected to be convenient for most applications. +Yet, the tool is expected to be convenient for most applications. .FOOTNOTE2 Section 2 provides an extensive documentation on how DODB works and how to use it. @@ -224,8 +251,8 @@ Finally, section 12 provides a conclusion. .SECTION How DODB works and basic usage DODB is a hash table. The key of the hash is an auto-incremented number and the value is the stored data. -The following section will explain how to use DODB for basic cases including the few added mechanisms to speed-up searches. -Also, the filesystem representation of the data will be presented since it enables easy off-application searches. +This section explains how to use DODB for basic cases including the few added mechanisms to speed-up searches. +Also, the filesystem representation of the data is presented since it enables easy off-application searches. The presented code is in Crystal such as the DODB library. Keep in mind that this document is all about the method more than the current implementation. @@ -1235,6 +1262,72 @@ In case this space isn't used for metadata, some filesystems enables to use it f .FOOTNOTE2 . .KS +.SSS "Exotic filesystems" +Filesystems have been developed over the years for various reasons. +Let's browse for a moment to provide an overview of what is possible. +.KE + +.B Kernel-related . +A whole class of filesystems is dedicated to provide an interface to the kernel, such as +.I procfs +(information about running processes), +.I sysfs +(to tweak a few device parameters) or even +.I debugfs +(to provide debug info from the kernel to user-space). +Providing information about the running system and enabling its modification through simple files and directories is a direct +.dq "everything is a file" +UNIX legacy. +Data cannot be freely written, files are directly related to specific structures which only accept a finite set of possible values; consistency is preserved with verifications written in the drivers. + +.B "Network-related" . +Many filesystems were designed specifically to be remotely mounted, either to be shared amongst many people in a company, or to be part of a giant cluster to provide a high-availability storage solution for tech giants with peculiar requirements or just to stack ever more commodity computers together and provide a gigantic storage space. +Filesystems can also be distributed with some replication in order to provide a fault-tolerant storage with ordinary computers sharing unused space. + +.KS +.B "UnionFS" . +UnionFS (and its variants) is a filesystem enabling several filesystems to be mounted on the same mount-point and to show superposed contents, enabling a read-only base image to be used together with persistent data for a specific instance. +This way, a +.dq "live-cd image" +for an operating system can become persistent by storing modifications on an usb stick. +.KE + +UnionFS is a copy-on-write snapshotting filesystem on top of other filesystems. +Docker uses it to save space. +Docker provides different ready-to-run software as small virtual machines. +To preserve storage space, a base OS image is shared amongst all instances and each instance only stores its own specific files (binaries, configuration and dependencies) written in a separate storage volume. + +.KS +.B "Archivemount" . +Mounting a compressed archive, enabling to use day-to-day tools to search for a file in an archive without the need to uncompress it. +.KE + +.KS +.B "RAM-based filesystems" . +For temporary data, intensive read and write operations on a small storage volume or for filesystem development, a chunk of the computer memory can be used as a filesystem thanks to +.B tmpfs +and variants\*[*] (ramdisk and ramfs). +.KE +.FOOTNOTE1 +.B ramdisk +creates a block file based on a chunk of RAM that needs to be formated then mounted as any partition. +.B ramfs +mounts directly a RAM-based filesystem, without the need to format a fake partition. +Finally, +.B tmpfs +is the more flexible one, it is used as ramfs but can be resized and only uses a necessary amount of RAM at a given point (memory is free'd once a file is removed). +.FOOTNOTE2 + +.KS +.B "Semantic (tag-based) filesystems" . +Some filesystems (such as tagsistant) store data based on tags for each file which enables to index a file based on many attributes and not a single path. +As a side effect, searching for a file in this context can be done by computing the intersection of different tags\*[*]. +.KE +.FOOTNOTE1 +Well well well… doesn't that sound like the DODB tag triggers? +As if databases and filesystems were intertwined somehow… +.FOOTNOTE2 +.KS .SSS "Quick comparison between DBMSs and filesystems" The following table shows the proximity between famous database systems and ordinary filesystems, both sharing a lot of features despite very different approaches. .ds OK \[OK] @@ -1256,7 +1349,7 @@ CRUD operations : SQL :files & directories Atomicity : \*[OK] :T{ locking mechanism based on files T} -Consistency : \*[OK] : \*[NOK] +Consistency : \*[OK] : \*[NOK] besides very specific filesystems Isolation : \*[OK] :T{ .dq "new file then mv" technique\*[*] @@ -1297,67 +1390,6 @@ The main difference between DBMSs and filesystems is the property. Filesystems are almost exclusively built to store undefined streams of data with a very wide range of different shapes (plain text, multimedia, documents, etc.) and sizes (from empty to multiple terabytes and more), thus no consistency verification can be reasonably implemented. . -.KS -.SSS "Exotic filesystems" -Filesystems have been developed over the years for various reasons. -Let's browse for a moment to provide an overview of what is possible. - -.B Kernel-related . -A whole class of filesystems is dedicated to provide an interface to the kernel, such as -.I procfs -(information about running processes), -.I sysfs -(to tweak a few device parameters) or even -.I debugfs -(to provide debug info from the kernel to user-space). -Providing information about the running system and enabling its modification through simple files and directories is a direct -.dq "everything is a file" -UNIX legacy. - -.B "Network-related" . -Many filesystems were designed specifically to be remotely mounted, either to be shared amongst many people in a company, or to be part of a giant cluster to provide a high-availability storage solution for tech giants with peculiar requirements or just to stack ever more commodity computers together and provide a gigantic storage space. -Filesystems can also be distributed with some replication in order to provide a fault-tolerant storage with ordinary computers sharing unused space. -.KE - -.KS -.B "UnionFS" . -UnionFS (and its variants) is a filesystem enabling several filesystems to be mounted on the same mount-point and to show superposed contents, enabling a read-only base image to be used together with persistent data for a specific instance. -This way, a -.dq "live-cd image" -for an operating system can become persistent by storing modifications on an usb stick. -UnionFS is a copy-on-write snapshotting filesystem on top of other filesystems. -.KE - -.KS -.B "Archivemount" . -Mounting a compressed archive, enabling to use day-to-day tools to search for a file in an archive without the need to uncompress it. -.KE - -.KS -.B "RAM-based filesystems" . -For temporary data, intensive read and write operations on a small storage volume or for filesystem development, a chunk of the computer memory can be used as a filesystem thanks to -.B tmpfs -and variants\*[*] (ramdisk and ramfs). -.KE -.FOOTNOTE1 -.B ramdisk -creates a block file based on a chunk of RAM that needs to be formated then mounted as any partition. -.B ramfs -mounts directly a RAM-based filesystem, without the need to format a fake partition. -Finally, -.B tmpfs -is the more flexible one, it is used as ramfs but can be resized and only uses a necessary amount of RAM at a given point (memory is free'd once a file is removed). -.FOOTNOTE2 - -.KS -.B "Semantic (tag-based) filesystems" . -Some filesystems (such as tagsistant) store data based on tags for each file which enables to index a file based on many attributes and not a single path. -As a side effect, searching for a file in this context can be done by computing the intersection of different tags\*[*]. -.KE -.FOOTNOTE1 -Well well well… doesn't that sound like the DODB tag triggers? -As if databases and filesystems were intertwined somehow… -.FOOTNOTE2 . .KS .SSS "Conclusion on filesystems" @@ -1380,8 +1412,6 @@ Having a directory with a few million entries is fine on modern filesystems. The first file access is slow (a few ms) then the kernel .B automatically caches the file, making it reachable in about a few dozen µs which is virtually nothing. - -TODO: des systèmes de fichiers dédiés . . .SECTION Alternatives @@ -1408,7 +1438,7 @@ These applications are inherently complex for different reasons. MadiaDB has 2.3 million lines of code (MLOC) and 1.7 MLOC for Postgres. Other mentioned DBMSs aren't open-source software, but it seems reasonable to consider their number of LOC to be in the same ballpark. .br -Just to put things into perspective, DODB is less than 1300 lines of code. +Just to put things into perspective, DODB is just a thousand lines of code. Sure, DODB doesn't have the same features, but are they worth multiplying the codebase by 1700? .FOOTNOTE2