Content-Defined Chunking

5 February 2019

While investigating Restic the other day for my personal backups, I came across the cool concept of content-defined chunking (aka sliding-block chunking, content-based slicing, and a bunch of other names). It’s a simple idea, but I thought it was pretty cool.

Background

People typically back their files up frequently (maybe daily or weekly), but a lot of someone’s files change very rarely. This means backup systems can save a lot of storage and transfer resources if they avoid copying these files to the backup location more than once.

Many systems do this by hashing each file to be updated, and storing each file in the backup storage using its hash as the key / filename. That way, they can easily verify whether the file is already present in the remote location by checking if its hash exists there. (This does mean that we need more housekeeping to restore the original filenames, but that can easily enough be added on top of the rest.)

Consider this example, where we already have 2 of 4 files backed up and are running a new backup:

Yes, I was hungry when making this graph

Notice how we’re gaining on two fronts:

  1. we don’t have to upload file1.txt or file2.txt again after the first backup (duplication in time),
  2. even though we’ve added two new files, since file4.txt is a duplicate of a preexisting file (file1.txt), we don’t have to upload it, and we only pay for uploading and storing 3 files in the backup store.

It’s important to use a cryptographic hash function for this, as the risk of collision must remain very low. Cryptographic hashes guarantee that, at the expense of being slow, but typically the bottleneck is uploading, not hashing.

Content-defined chunking

This is all great, but what if you have large files that change just a little? Obviously the solution above won’t work well, as it only handles files that are exactly identical.

A simple solution would be to chunk your file in fixed-size blocks. Instead of deduplicating files, we would then deduplicate blocks. This works well enough for some cases, where some data changes in the file or gets appended at the end:

A happy case for fixed chunking

but it’s useless if we insert or delete data in the middle (or worse, at the beginning) of the file, because all blocks will be shifted:

A no-so-good case for fixed chunking

Instead, what if we could be smarter about which parts haven’t changed? The core of content-defined chunking’s idea is to create a block boundary based on what is in the file. For instance, we could decide to create a block boundary every time we see the letter “a” in the file. Even if content is inserted in the middle of a block, the boundary will most likely remain in the same place. This means that although the block with insertions can no longer be reused, blocks that come after it should remain unaffected:

The same as above, with CDC.

We can now reuse a lot more blocks, even though we changed some things in the middle that shifted everything else.

An issue with content-defined chunking (or CDC), as shown above, is that the number of blocks can vary. This can be an issue for some use cases, but in the context of backup systems, this is generally not an issue as we deduplicate using the block’s hash, not its index.

There are a few points still to settle. What if ‘a’s only happen infrequently, or are very irregularly distributed? We would risk ending up with a single giant block, or some very large and some very small blocks.

The solution to this issue is to tweak the condition for creating a block boundary. Instead, we keep a rolling hash of the last few bytes we’ve seen (32 or 64 are popular choices) as we go through the file, and we create a block boundary every time the hash is equal to (for instance) 0. If we pick a good hashing algorithm, its output should be randomly distributed enough that we should expect a 0 to occur at near equally spaced intervals.

(Note that this is not fully true: a file that just repeats the same byte, for instance, will still result in a constant rolling hash value, and will either never create a block boundary, or create a block at every byte. We can fix this by including the byte index in the rolling hash.)

We can also select the average size of blocks by trimming hash before checking if it’s 0. For instance, if we want blocks to be $2^{20}$ bytes = 1MB, and our hashes are 32 bits long, we can just remove the top 12 bytes to end up with a 20-byte value. Assuming the hashes are approximately uniformly distributed, every byte in the input has a probability of $2^{-20}$ of the hash value being 0 at its position. Using the expectancy of the geometric distribution, we get that on average a boundary will be created every $2^{20}$ blocks. This means that blocks will be 1MB on average.

References:

Comments