zvault/docs/bundle_format.md

3.4 KiB

% Bundle file format

Bundle file format

The bundle file format consists of 4 parts:

  • A magic header with version
  • An encoded header structure
  • An encoded chunk list
  • The chunk data

The main reason for having those multiple parts is that it is expected that the smaller front parts can be read much faster than the the whole file. So information that is needed more frequently is put into earlier parts and the data that is need the least frequent is put into the latter part so that it does not slow down reading the front parts. Keeping those parts in separate files was also considered but rejected to increase the reliability of the storage.

Magic header with version

The first part of a bundle file contains an 8 byte magic header with version information.

The first 6 bytes of the header consist of the fixed string "zvault", followed by one byte with the fixed value 0x01. Those 7 bytes make up the magic header of the file and serve to identify the file type as a zvault bundle file.

The 8th byte of the first file part is the version of the file format. This value is currently 0x01 and is expected to be increased for any breaking changes in the file format.

Encoded header structure

The encoded header structure is the second part of the bundle file format and follows directly after the 8 bytes of the magic header.

The header structure is defined in bundle.rs as BundleInfo and contains general information on the bundle's contents and on how to decode the other two parts of the bundle file.

This header structure is encoded using the MsgPack format. It is neither compressed (since its size is pretty small) nor encrypted (since it only contains general information and no user data) in any way.

Encoded chunk list

The chunk list is the third part of the bundle file and follows directly after the encoded header structure.

The chunk list contains hashes and sizes of all chunks stored in this bundle in the order they are stored. The list is encoded efficiently as 20 bytes per chunk (16 for the hash and 4 for the size) as defined in ../util/chunk.rs.

Since the chunk list contains confidential information (the chunk hashes and sized can be used to identify files) the encoded chunk list is encrypted using the encryption method specified in the header structure. The header structure also contains the full size of the encoded and encrypted chunk list which is needed since the encryption could add some bytes for a nonce or an authentication code.

The chunk list is not compressed since the hashes have a very high entropy and do not compress significantly.

The chunk list is not stored in the header structure because it contains confidential data and the encryption method is stored in the header. Also the chunk list can be pretty big compared to the header which needs to be read more often.

Chunk data

The chunk data is the final part of a bundle file and follows after the encoded chunk list. The starting position can be obtained from the header as the encoded size of the chunk list is stored there.

The chunk data part consists of the content data of the chunks contained in this bundle simply concatenated without any separator. The actual size (and by summing up the sizes also the starting position) of each chunk can be obtained from the chunk list.

The chunk data is compressed as whole (solid archive) and encrypted with the methods specified in the bundle header structure.