zvault/README.md

131 lines
3.2 KiB
Markdown
Raw Normal View History

2016-09-21 08:25:33 +00:00
# ZVault Backup solution
2017-03-10 11:43:32 +00:00
## Goals / Features
### Space-efficient storage with deduplication
The backup data is split into chunks. Fingerprints make sure that each chunk is
only stored once. The chunking algorithm is designed so that small changes to a
file only change a few chunks and leave most chunks unchanged.
Multiple backups of the same data set will only take up the space of one copy.
The chunks are combined into bundles. Each bundle holds chunks up to a maximum
data size and is compressed as a whole to save space ("solid archive").
### Independent backups
All backups share common data in form of chunks but are independent on a higher
2017-03-22 16:28:45 +00:00
level. Backups can be deleted and chunks that are not used by any backup can be
2017-03-10 11:43:32 +00:00
removed.
Other backup solutions use differential backups organized in chains. This makes
those backups dependent on previous backups in the chain, so that those backups
can not be deleted. Also, restoring chained backups is much less efficient.
### Fast backup runs
* Only adding changed files
* In-Memory Hashtable
### Backup verification
* Bundles verification
* Index verification
* File structure verification
## Configuration options
There are several configuration options with trade-offs attached so these are
exposed to users.
### Chunker algorithm
The chunker algorithm is responsible for splitting files into chunks in a way
that survives small changes to the file so that small changes still yield
many matching chunks. The quality of the algorithm affects the deduplication
rate and its speed affects the backup speed.
There are 3 algorithms to choose from:
The **Rabin chunker** is a very common algorithm with a good quality but a
mediocre speed (about 350 MB/s).
The **AE chunker** is a novel approach that can reach very high speeds
(over 750 MB/s) but at a cost of quality.
The **FastCDC** algorithm has a slightly higher quality than the Rabin chunker
and is quite fast (about 550 MB/s).
The recommendation is **FastCDC**.
### Chunk size
The chunk size determines the memory usage during backup runs. For every chunk
in the backup repository, 24 bytes of memory are needed. That means that for
every GiB stored in the repository the following amount of memory is needed:
- 8 KiB chunks => 3 MiB / GiB
- 16 KiB chunks => 1.5 MiB / GiB
- 32 KiB chunks => 750 KiB / GiB
- 64 KiB chunks => 375 KiB / GiB
On the other hand, bigger chunks reduce the deduplication efficiency. Even small
changes of only one byte will result in at least one complete chunk changing.
### Hash algorithm
Blake2
Murmur3
Recommended: Blake2
### Bundle size
10 M
25 M
100 M
Recommended: 25 MiB
### Compression
Recommended: Brotli/2-7
2016-09-21 08:25:33 +00:00
## Design
## TODO
2017-03-22 16:28:45 +00:00
### Core functionality
- Recompress & combine bundles
- Allow to use tar files for backup and restore (--tar, http://alexcrichton.com/tar-rs/tar/index.html)
2017-03-22 18:58:56 +00:00
- File attributes
- xattrs https://crates.io/crates/xattr
2017-03-22 16:28:45 +00:00
2017-04-02 16:55:53 +00:00
### Formats
- Bundles
- Encrypted bundle header
- Random bundle name
- Metadata
- Arbitrarily nested chunk lists
2017-03-22 16:28:45 +00:00
### CLI functionality
- list --tree
### Other
- Stability
- Tests & benchmarks
- Chunker
- Index
- BundleDB
- Bundle map
- Config files
- Backup files
- Backup
- Prune
- Vacuum
- Documentation
- All file formats
- Design