diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..2dbcefd --- /dev/null +++ b/docs/design.md @@ -0,0 +1,91 @@ +% Design Document +# Design Document + +## Project Goals +The main goal of zVault is to provide a backup solution that is both reliable +and efficient. + +Backups should be stored in a way so that they can be restored **reliably**. A +backup that can not be restored is worthless. Backups should be stored in a +**robust** fashion that allows minor changes to remote backup files or losses in +local cache. There should be a way to **verify the integrity** of backups. + +Backups should support **remote storage**. Remote backup files should be stored +on a mounted remote storage (e.g. via `rclone mount`). To support this use case, +remote backup files should be handled with only common file operations so that +dumb remote filesystems can be supported. + +The backup process should be **fast**, especially in the common case where only +small changes happened since the last backup. This means that zVault should be +able to find an existing backup for reference and use it to detect differences. + +The backups should be stored in a **space-efficient and deduplicating** way, to +save storage space, especially in the common case where only small changes +happened since the last backup. The individual backups should be independent of +each other to allow the removal of single backups based on age in a phase-out +scheme. + + +## Backup process +The main idea of zVault is to split the data into **chunks** which are stored +remotely. The chunks are combined in **bundles** and compressed and encrypted as +a whole to increase the compression ratio and performance. + +An **index** stores **hashes** of all chunks together with their bundle id and +position in that bundle, so that chunks are only stored once and can be reused +by later backups. The index is implemented as a memory-mapped file to maximize +the backup performance. + +To split the data into chunks a so-called **chunker** is used. The main goal of +the chunker is to create a maximal amount of same chunks when only a few changes +happened in a file. This is especially tricky when bytes are inserted or deleted +so that the rest of the data is shifted. The chunker uses content-dependent +methods to split the data in order to handle those cases. + +By splitting data into chunks and storing those chunks remotely as well as in +the index, any stream of data (e.g. file contents) can be represented by a list +of chunk identifiers. This method is used to represent the contents of a file +and store it in the file metadata. This metadata is then encoded as a data +stream and again represented as a chunk list. Directories contain their children +(e.g. files and other directories) by referring to their metadata as a chunk +list. So finally, the whole directory tree of a backup can be represented as the +chunk list of the root directory which is then stored in a separate backup file. + + +## Saving space +The design of zVault contains multiple ways in which storage space can be saved. + +The most important is deduplication which makes sure that chunks are only stored +once. If only few changes happened since the last backup, almost all chunks are +already present in the index and do not have to be written to remote storage. +Depending on how little data has changed since the last backup, this can save up +to 100% of the storage space. + +But deduplication also works within the same backup. Depending on data, +deduplication can save about 10%-20% even on new data due to repetitions in the +data. + +If multiple systems use the same remote storage, they can benefit from backups +of other machines and use their chunks for deduplication. This is especially +helpful in the case of whole system backups where all systems use the same +operating system. + +Finally zVault uses a powerfull compression that achieves about 1/3 space +reduction in common cases to store the bundles. + +In total, a whole series of backups is often significantly smaller than the data +contained in any of the individual backups. + + +## Vacuum process +As backups are removed, some chunks become unused and could be removed to free +storage space. However, as chunks are combined in bundles, they can not be +removed individually and all other backups must also be checked in order to make +sure the chunks are truly unused. + +zVault provides an analysis method that scans all backups and identifies unused +chunks in bundles. The vacuum process can then be used to reclaim the space used +by those chunks by rewriting the effected bundles. Since all used chunks in the +bundle need to be written into new bundles and the reclaimed space depends on +the amount of unused chunks, only bundles with a high ratio of unused chunks +should be rewritten.