zvault(1) -- Deduplicating backup solution
==========================================

## SYNOPSIS

`zvault <SUBCOMMAND>`



## DESCRIPTION

ZVault is a deduplicating backup solution. It creates backups from data read
from the filesystem or a tar file, deduplicates it, optionally compresses and
encrypts the data and stores the data in bundles at a potentially remote storage
location.



## OPTIONS

  * `-h`, `--help`:

    Prints help information


  * `-V`, `--version`:     

    Prints version information



## SUBCOMMANDS


### Main Commands

  * `init`          Initialize a new repository, _zvault-init(1)_
  * `import`        Reconstruct a repository from the remote storage, _zvault-import(1)_
  * `backup`        Create a new backup, _zvault-backup(1)_
  * `restore`       Restore a backup or subtree, _zvault-restore(1)_
  * `check`         Check the repository, a backup or a backup subtree, _zvault-check(1)_
  * `list`          List backups or backup contents, _zvault-list(1)_
  * `info`          Display information on a repository, a backup or a subtree, _zvault-info(1)_
  * `mount`         Mount the repository, a backup or a subtree, _zvault-mount(1)_
  * `remove`        Remove a backup or a subtree, _zvault-remove(1)_
  * `prune`         Remove backups based on age, _zvault-prune(1)_
  * `vacuum`        Reclaim space by rewriting bundles, _zvault-vacuum(1)_


### Other Commands

  * `addkey`        Add a key pair to the repository, _zvault-addkey(1)_
  * `algotest`      Test a specific algorithm combination, _zvault-algotest(1)_
  * `analyze`       Analyze the used and reclaimable space of bundles, _zvault-analyze(1)_
  * `bundleinfo`    Display information on a bundle, _zvault-bundleinfo(1)_
  * `bundlelist`    List bundles in a repository, _zvault-bundlelist(1)_
  * `config`        Display or change the configuration, _zvault-config(1)_
  * `diff`          Display differences between two backup versions, _zvault-diff(1)_
  * `genkey`        Generate a new key pair, _zvault-genkey(1)_
  * `versions`      Find different versions of a file in all backups, _zvault-versions(1)_


## USAGE

### Path syntax

Most subcommands work with a repository that has to be specified as a parameter.
If this repository is specified as `::`, the default repository in `~/.zvault`
will be used instead.

Some subcommands need to reference a specific backup in the repository. This is
done via the syntax `repository::backup_name` where `repository` is the path to
the repository and `backup_name` is the name of the backup in that repository
as listed by `zvault list`. In this case, `repository` can be omitted,
shortening the syntax to `::backup_name`. In this case, the default repository
is used.

Some subcommands need to reference a specific subtree inside a backup. This is
done via the syntax `repository::backup_name::subtree` where
`repository::backup_name` specifies a backup as described before and `subtree`
is the path to the subtree of the backup. Again, `repository` can be omitted,
yielding the shortened syntax `::backup_name::subtree`.

Some subcommands can take either a repository, a backup or a backup subtree. In
this case it is important to note that if a path component is empty, it is
regarded as not set at all.

Examples:

- `~/.zvault` references the repository in `~/.zvault` and is identical with
  `::`.
- `::backup1` references the backup `backup1` in the default repository
- `::backup1::/` references the root folder of the backup `backup1` in the
  default repository


## CONFIGURATION OPTIONS
ZVault offers some configuration options that affect the backup speed, storage
space, security and RAM usage. Users should select them carefully for their
scenario. The performance of different combinations can be compared using
_zvault-algotest(1)_.


### Bundle size
The target bundle size affects how big bundles written to the remote storage
will become. The configured size is not a hard maximum, as headers and some
effects of compression can cause bundles to become slightly larger than this
size. Also since bundles will be closed at the end of a backup run, some bundles
can also be smaller than this size. However most bundles will end up with
approximately the specified size.

The configured value for the bundle size has some practical consequences.
Since the whole bundle is compressed as a whole (a so-called *solid archive*),
the compression ratio is impacted negatively if bundles are small. Also the
remote storage could become inefficient if too many small bundle files are
stored. On the other side, since the whole bundle has to be fetched and
decompressed to read a single chunk from that bundle, bigger bundles increase
the overhead of reading the data.

The recommended bundle size is 25 MiB, but values between 5 MiB and 100 MiB
should also be feasable.


### Chunker
The chunker is the component that splits the input data into so-called *chunks*.
The main goal of the chunker is to produce as many identical chunks as possible
when only small parts of the data changed since the last backup. The identical
chunks do not have to be stored again, thus the input data is deduplicated.
To achieve this goal, the chunker splits the input data based on the data
itself, so that identical parts can be detected even when their position
changed.

ZVault offers different chunker algorithms with different properties to choose
from:

- The **rabin** chunker is a very common algorithm with a good quality but a
  mediocre speed (about 350 MB/s).
- The **ae** chunker is a novel approach that can reach very high speeds
  (over 750 MB/s) at a cost of deduplication rate.
- The **fastcdc** algorithm reaches a similar deduplication rate as the rabin
  chunker but is faster (about 550 MB/s).

The recommended chunker is **fastcdc**.

Besides the chunker algorithm, an important setting is the target chunk size,
i.e. the planned average chunk size. Since the chunker splits the data on
data-dependent criteria, it will not achieve the configured size exactly.
The chunk size has a number of practical implications. Since deduplication works
by identifying identical chunks, smaller chunk sizes will be able to find more
identical chunks and thereby reduce the overall storage space.

On the other side, the index needs to store 24 bytes per chunk, so many small
chunks will take more space than few big chunks. Since the index of all chunks
in the repository needs to be loaded into memory during the backup, huge
repositories can get a problem with memory usage. Since the index could be only
40% filled and the chunker could yield smaller chunks than configured, 100 bytes
per chunk should be a safe value to calculate with.

The configured value for chunk size needs to be a power of 2. Here is a
selection of chunk sizes and their estimated RAM usage:

- Chunk size 4 KiB => ~40 GiB data stored in 1 GiB RAM
- Chunk size 8 KiB => ~80 GiB data stored in 1 GiB RAM
- Chunk size 16 KiB => ~160 GiB data stored in 1 GiB RAM
- Chunk size 32 KiB => ~325 GiB data stored in 1 GiB RAM
- Chunk size 64 KiB => ~650 GiB data stored in 1 GiB RAM
- Chunk size 128 KiB => ~1.3 TiB data stored in 1 GiB RAM
- Chunk size 256 KiB => ~2.5 TiB data stored in 1 GiB RAM
- Chunk size 512 KiB => ~5 TiB data stored in 1 GiB RAM
- Chunk size 1024 KiB => ~10 TiB data stored in 1 GiB RAM

The recommended chunk size for normal computers is 16 KiB. Servers with lots of
data might want to use 128 KiB or 1024 KiB instead.

The chunker algortihm and chunk size are configured together in the format
`algorithm/size` where algorithm is one of `rabin`, `ae` and `fastcdc` and size
is the size in KiB e.g. `16`. So the recommended configuration is `fastcdc/16`.

Please not that since the chunker algorithm and chunk size affect the chunks
created from the input data, any change to those values will make existing
chunks inaccessible for deduplication purposes. The old data is still readable
but new backups will have to store all data again.


### Compression
ZVault offers different compression algorithms that can be used to compress the
stored data after deduplication. The compression ratio that can be achieved
mostly depends on the input data (test data can be compressed well and media
data like music and videos are already compressed and can not be compressed
significantly).

Using a compression algorithm is a trade-off between backup speed and storage
space. Higher compression takes longer and saves more space while low
compression is faster but needs more space.

ZVault supports the following compression methods:

- **deflate** (also called *zlib* and *gzip*) is the most common algorithm today
  and guarantees that backups can be decompressed in future centuries. Its
  speed and compression ratio are acceptable but other algorithms are better.
  This is a rather conservative choice. This algorithm supports the levels 1
  (fastest) to 9 (best).
- **lz4** is a very fast compression algorithm that does not impact backup speed
  very much. It does not compress as good as other algorithms but is much faster
  than all other algorithms. This algorithm supports levels 1 (fastest) to 14
  (best) but levels above 7 are significantly slower and not recommended.
- **brotli** is a modern compression algorithm that is both faster and
  compresses better than deflate. It offers a big range of compression ratios
  and speeds via its levels. This algorithm supports levels 1 (fastest) to 10
  (best).
- **lzma** is about the algorithm with the best compression today. That comes
  at the cost of speed. LZMA is rather slow at all levels so it can slow down
  the backup speed significantly. This algorithm supports levels 1 (fastest) to
  9 (best).

The recommended combinations are:

- Focusing speed: lz4 with level between 1 and 7
- Balanced focus: brotli with levels between 1 and 10
- Focusing storage space: lzma with levels between 1 and 9

The compression algorithm and level are configured together via the syntax
`algorithm/level` where `algorithm` is either `deflate`, `lz4`, `brotli` or
`lzma` and `level` is a number.

The default compression setting is **brotli/3**.

Since the compression ratio and speed hugely depend on the input data,
_zvault-algotest(1)_ should be used to compare algorithms with actual input
data.



### Encryption
When enabled, zVault uses modern encryption provided by *libsodium* to encrypt
the bundles that are stored remotely. This makes it impossible for anyone with
access to the remote bundles to read their contents or to modify them.

zVault uses asymmetric encryption, which means that encryption uses a so called
*public key* and decryption uses a different *secret key*. This makes it
possible to setup a backup configuration where the machine can only create
backups but not read them. Since lots of subcommands need to read the backups,
this setup is not recommended in general.

The key pairs used by zVault can be created by _zvault-genkey(1)_ and added to a
repository via _zvault-addkey(1)_ or upon creation via the `--encryption` flag
in _zvault-init(1)_.

**Important: The key pair is needed to read and restore any encrypted backup.
Loosing the secret key means that all data in the backups is lost forever.
There is no backdoor, even the developers of zVault can not recover a lost key
pair. So it is important to store the key pair in a safe location. The key pair
is small enough to be printed on paper for example.**


### Hash method
ZVault uses hash fingerprints to identify chunks. It is critically important
that no two chunks have the same hash value (a so-called hash collision) as this
would cause one chunk to overwrite the other chunk. For this purpose zVault uses
128 bit hashes, that have a collision probability of less than 1.5e-15 even for
1 trillion stored chunks (about 15.000 TiB stored data in 16 KiB chunks).

ZVault offers two different hash algorithms: **blake2** and **murmur3**.

Murmur3 is blazingly fast but is not cryptographically secure. That means that
while random hash collisions are negligible, an attacker with access to files
could manipulate a file so that it will cause a hash collision and affects other
data in the repository. **This hash should only be used when the security
implications of this are fully understood.**

Blake2 is slower than murmur3 but also pretty fast and this hash algorithm is
cryptographically secure, i.e. even an attacker can not cause hash collisions.

The recommended hash algorithm is **blake2**.



## EXAMPLES

This command will initialize a repository in the default location with
encryption enabled:

    $> zvault init :: -e --remote /mnt/remote/backups

Before using this repository, the key pair located at `~/.zvault/keys` should be
backed up in a safe location (e.g. printed to paper).

This command will create a backup of the whole system tagged by date:

    $> zvault backup / ::system/$(date +%F)

If the home folders are mounted on /home, the following command can be used to
backup them separatly (zVault will not backup mounted folders by default):

    $> zvault backup /home ::homes/$(date +%F)

The backups can be listed by this command:

    $> zvault list ::

and inspected by this command (the date needs to be adapted):

    $> zvault info ::homes/2017-04-06

To restore some files from a backup, the following command can be used:

    $> zvault restore ::homes/2017-04-06::bob/file.dat /tmp

Alternatively the repository can be mounted with this command:

    $> zvault mount ::homes/2017-04-06 /mnt/tmp

A single backup can be removed with this command:

    $> zvault remove ::homes/2017-04-06

Multiple backups can be removed based on their date with the following command
(add `-f` to actually remove backups):

    $> zvault prune :: --prefix system --daily 7 --weekly 5 --monthly 12

To reclaim storage space after removing some backups vacuum needs to be run
(add `-f` to actually remove bundles):

    $> zvault vacuum ::



## COPYRIGHT

Copyright (C) 2017  Dennis Schwerdel
This software is licensed under GPL-3 or newer (see LICENSE.md)