Design document

2017-04-01 19:42:52 +02:00 · 2017-04-01 19:42:52 +02:00 · 70e6026e5a
parent 4c07e6d5d6
commit 70e6026e5a
1 changed files with 91 additions and 0 deletions
--- a/docs/design.md
+++ b/docs/design.md
@ -0,0 +1,91 @@
 % Design Document
 # Design Document
 ## Project Goals
 The main goal of zVault is to provide a backup solution that is both reliable
 and efficient.
 Backups should be stored in a way so that they can be restored **reliably**. A
 backup that can not be restored is worthless. Backups should be stored in a
 **robust** fashion that allows minor changes to remote backup files or losses in
 local cache. There should be a way to **verify the integrity** of backups.
 Backups should support **remote storage**. Remote backup files should be stored
 on a mounted remote storage (e.g. via `rclone mount`). To support this use case,
 remote backup files should be handled with only common file operations so that
 dumb remote filesystems can be supported.
 The backup process should be **fast**, especially in the common case where only
 small changes happened since the last backup. This means that zVault should be
 able to find an existing backup for reference and use it to detect differences.
 The backups should be stored in a **space-efficient and deduplicating** way, to
 save storage space, especially in the common case where only small changes
 happened since the last backup. The individual backups should be independent of
 each other to allow the removal of single backups based on age in a phase-out
 scheme.
 ## Backup process
 The main idea of zVault is to split the data into **chunks** which are stored
 remotely. The chunks are combined in **bundles** and compressed and encrypted as
 a whole to increase the compression ratio and performance.
 An **index** stores **hashes** of all chunks together with their bundle id and
 position in that bundle, so that chunks are only stored once and can be reused
 by later backups. The index is implemented as a memory-mapped file to maximize
 the backup performance.
 To split the data into chunks a so-called **chunker** is used. The main goal of
 the chunker is to create a maximal amount of same chunks when only a few changes
 happened in a file. This is especially tricky when bytes are inserted or deleted
 so that the rest of the data is shifted. The chunker uses content-dependent
 methods to split the data in order to handle those cases.
 By splitting data into chunks and storing those chunks remotely as well as in
 the index, any stream of data (e.g. file contents) can be represented by a list
 of chunk identifiers. This method is used to represent the contents of a file
 and store it in the file metadata. This metadata is then encoded as a data
 stream and again represented as a chunk list. Directories contain their children
 (e.g. files and other directories) by referring to their metadata as a chunk
 list. So finally, the whole directory tree of a backup can be represented as the
 chunk list of the root directory which is then stored in a separate backup file.
 ## Saving space
 The design of zVault contains multiple ways in which storage space can be saved.
 The most important is deduplication which makes sure that chunks are only stored
 once. If only few changes happened since the last backup, almost all chunks are
 already present in the index and do not have to be written to remote storage.
 Depending on how little data has changed since the last backup, this can save up
 to 100% of the storage space.
 But deduplication also works within the same backup. Depending on data,
 deduplication can save about 10%-20% even on new data due to repetitions in the
 data.
 If multiple systems use the same remote storage, they can benefit from backups
 of other machines and use their chunks for deduplication. This is especially
 helpful in the case of whole system backups where all systems use the same
 operating system.
 Finally zVault uses a powerfull compression that achieves about 1/3 space
 reduction in common cases to store the bundles.
 In total, a whole series of backups is often significantly smaller than the data
 contained in any of the individual backups.
 ## Vacuum process
 As backups are removed, some chunks become unused and could be removed to free
 storage space. However, as chunks are combined in bundles, they can not be
 removed individually and all other backups must also be checked in order to make
 sure the chunks are truly unused.
 zVault provides an analysis method that scans all backups and identifies unused
 chunks in bundles. The vacuum process can then be used to reclaim the space used
 by those chunks by rewriting the effected bundles. Since all used chunks in the
 bundle need to be written into new bundles and the reclaimed space depends on
 the amount of unused chunks, only bundles with a high ratio of unused chunks
 should be rewritten.