Brotli Decompression in Rust
Once the files have been uploaded, they need to be durably persisted as long as the user wishes, and at a moment’s notice they may need to be restored to their original bits exactly in a repeatable, secure way.
For Dropbox, any decompressor must exhibit three properties:
1. it must be safe and secure, even against bytes crafted by modified or hostile clients,
2. it must be deterministic—the same bytes must result in the same output,
3. it must be fast.
With these properties we can accept any arbitrary bytes from a client and have full knowledge that those bytes factually represent the file data.
Unfortunately, the compressor supplied by the Brotli project only has the third property: it is very fast. Since the Brotli decompressor consists of a substantial amount of C code written by human beings, it is possibly neither deterministic nor safe and secure against carefully crafted hostile data. It could be both secure and deterministic, but there is simply too much code to reason through a mathematical proof of this hypothesis.
Operating at Dropbox scale, we need to guarantee the security of our data, so our approach was to break down the problem into components. By writing a new Brotli decompressor in a language that is safe and deterministic, we only needed to analyze the language, not all the code written in it. This is because such a language would prevent us from executing unsafe code (eg. array out of bounds access) or nondeterministic code (eg reading uninitialized memory), so therefore we can trust the code to repeatably produce the same output without any security risks.
The Rust programing language fits the bill perfectly: it’s a language that promises memory safety without garbage collection, concurrency without data races, and abstractions without overhead. It also has sufficient performance for our needs. That means that code written in Rust has the same memory requirements as the equivalent code written in C. At Dropbox, many of our services are actually memory bound, so this is a key advantage over a garbage collected language.