Design document

2017-04-01 19:42:52 +02:00 · 2017-04-01 19:42:52 +02:00 · 70e6026e5a
parent 4c07e6d5d6
commit 70e6026e5a
1 changed files with 91 additions and 0 deletions
--- a/docs/design.md
+++ b/docs/design.md
@ -0,0 +1,91 @@
+% Design Document
+# Design Document
+
+## Project Goals
+The main goal of zVault is to provide a backup solution that is both reliable
+and efficient.
+
+Backups should be stored in a way so that they can be restored **reliably**. A
+backup that can not be restored is worthless. Backups should be stored in a
+**robust** fashion that allows minor changes to remote backup files or losses in
+local cache. There should be a way to **verify the integrity** of backups.
+
+Backups should support **remote storage**. Remote backup files should be stored
+on a mounted remote storage (e.g. via `rclone mount`). To support this use case,
+remote backup files should be handled with only common file operations so that
+dumb remote filesystems can be supported.
+
+The backup process should be **fast**, especially in the common case where only
+small changes happened since the last backup. This means that zVault should be
+able to find an existing backup for reference and use it to detect differences.
+
+The backups should be stored in a **space-efficient and deduplicating** way, to
+save storage space, especially in the common case where only small changes
+happened since the last backup. The individual backups should be independent of
+each other to allow the removal of single backups based on age in a phase-out
+scheme.
+
+
+## Backup process
+The main idea of zVault is to split the data into **chunks** which are stored
+remotely. The chunks are combined in **bundles** and compressed and encrypted as
+a whole to increase the compression ratio and performance.
+
+An **index** stores **hashes** of all chunks together with their bundle id and
+position in that bundle, so that chunks are only stored once and can be reused
+by later backups. The index is implemented as a memory-mapped file to maximize
+the backup performance.
+
+To split the data into chunks a so-called **chunker** is used. The main goal of
+the chunker is to create a maximal amount of same chunks when only a few changes
+happened in a file. This is especially tricky when bytes are inserted or deleted
+so that the rest of the data is shifted. The chunker uses content-dependent
+methods to split the data in order to handle those cases.
+
+By splitting data into chunks and storing those chunks remotely as well as in
+the index, any stream of data (e.g. file contents) can be represented by a list
+of chunk identifiers. This method is used to represent the contents of a file
+and store it in the file metadata. This metadata is then encoded as a data
+stream and again represented as a chunk list. Directories contain their children
+(e.g. files and other directories) by referring to their metadata as a chunk
+list. So finally, the whole directory tree of a backup can be represented as the
+chunk list of the root directory which is then stored in a separate backup file.
+
+
+## Saving space
+The design of zVault contains multiple ways in which storage space can be saved.
+
+The most important is deduplication which makes sure that chunks are only stored
+once. If only few changes happened since the last backup, almost all chunks are
+already present in the index and do not have to be written to remote storage.
+Depending on how little data has changed since the last backup, this can save up
+to 100% of the storage space.
+
+But deduplication also works within the same backup. Depending on data,
+deduplication can save about 10%-20% even on new data due to repetitions in the
+data.
+
+If multiple systems use the same remote storage, they can benefit from backups
+of other machines and use their chunks for deduplication. This is especially
+helpful in the case of whole system backups where all systems use the same
+operating system.
+
+Finally zVault uses a powerfull compression that achieves about 1/3 space
+reduction in common cases to store the bundles.
+
+In total, a whole series of backups is often significantly smaller than the data
+contained in any of the individual backups.
+
+
+## Vacuum process
+As backups are removed, some chunks become unused and could be removed to free
+storage space. However, as chunks are combined in bundles, they can not be
+removed individually and all other backups must also be checked in order to make
+sure the chunks are truly unused.
+
+zVault provides an analysis method that scans all backups and identifies unused
+chunks in bundles. The vacuum process can then be used to reclaim the space used
+by those chunks by rewriting the effected bundles. Since all used chunks in the
+bundle need to be written into new bundles and the reclaimed space depends on
+the amount of unused chunks, only bundles with a high ratio of unused chunks
+should be rewritten.