S3 Backup Encryption

Block cipher

We are using AES-256 as the block cipher.

The Advanced Encryption Standard (AES) is a symmetric-key encryption standard adopted by the U.S. government. AES was announced by National Institute of Standards and Technology (NIST) as [a standard] after a 5-year standardization process in which fifteen competing designs were presented and evaluated before Rijndael was selected as the most suitable. [...]

In June 2003, the U.S. Government announced that AES may be used to protect classified information: The design and strength of all key lengths of the AES algorithm (i.e., 128, 192 and 256) are sufficient to protect classified information up to the SECRET level. TOP SECRET information will require use of either the 192 or 256 key lengths.

We use AES with 256-bit keys (this is true for all versions of S3 Backup), but the security of encrypted data depends not only on the cipher algorithm but also on its mode of operation and the strength of the key itself. 1.0.2 will bring a significant improvement for both. Up to now we were using the ECB (electronic codebook) mode of operation and trivial padding scheme which is unfortunately not the strongest there is. From now on we are switching to CFB (cipher feedback) which is much stronger and requires no padding.

Key strengthening

Before I explain the next level of defence I need to mention what the threat is. Assuming someone wants your data, what he will need first is to get the encrypted data itself. That would be possible by stealing your AWS credentials, by hacking into Amazons datacenter or by getting your data released via a subpoena. You might be interested in the “Overview of Security Processes” whitepaper by Amazon and their security center.

The next step is to get the data decrypted, which is only possible by using the same key that was used for encryption. This key is 256-bit long and there are so many possible keys that it’s simply unfeasible to try them all.

Another thing is that the AES decryption process can be applied to any stream of bytes with any key, but the resulting data are garbage for any key except the right one, and that means that when the legitimate user tries to decrypt the data, the key needs to be somehow verified first. To do that, the key signature is calculated and stored along with the encrypted data. When decrypting, only the key with the correct signature will be accepted. The signature is calculated using a cryptographic hash function and is irreversible — knowing the signature does not allow one to just calculate the inverse and get the original key.

The attacker could use that signature to verify the attempted keys just like the app would on behalf of the legitimate user. However he needs to do it for every key he tries and there’s still a question of which keys to try — trying them all is impossible. There are a few things to consider here, one is a dictionary attack which uses password-guessing to limit the space of keys to search, another is precomputed hash maps which provide a space-time tradeoff and allow the attacker to crack a lot of keys faster, if they aren’t strong enough.

Key derivation functions

But if the key can be any 256-bit block of data, how can it be strong or weak? That is because the keys are not just random data but have to be derived from a user-provided password (or AWS secret key, if no password was given). So what happens is that we have a KDF (key derivation function) that takes a password and gives us a key useable for the AES.

A naive KDF would be vulnereble to precomputed hash maps mentioned above. The way to remedy that is to use a KDF with a random “salt” and store that salt alongside the key signature.

A lot of software then uses a regular crypto hash like SHA-256 on the salted password as the KDF and for calculating the key signature as well. The dictionary attack is still a threat in such a case though and thus we use a different KDF. Dictionary attacks involve trying different passwords, applying the KDF to them, calculating the signature and comparing it to the expected one. This takes a long time either way, but if the KDF is chosen in such a way that it intentionally takes a long time to compute, finding the password becomes infeasible, because the search becomes simply too slow.

Key derivation improvements

The function we use for this purpose is bcrypt-10, a Blowfish-based future-adaptable password scheme originating from the OpenBSD project. It has a configurable speed (number of iterations) and can be slowed down to any intended degree. An optimized implementation (such as one we’re using) takes about 1/10th of a second to run just once on a reasonably modern computer, so even a very dedicated and well-resourced attacker would have trouble making progress at that pace. As the computers get faster in the future, we will be able to ramp up the complexity by switching to bcrypt-12, bcrypt-14 etc, each successive increment being four times as demanding.

This is a huge improvement over regular hash functions which can be computed millions of times per second on the same computer. However, such a slow KDF would slow down the file transfers adding those 100ms to every upload and download, if computed for every file. To avoid this we’ve changed how the salt is generated, and instead of generating a new one for every upload we generate just one per session and use the same encryption key for all uploads in that session. The reverse lookup is optimized as well, so when downloading the files the penalty for finding a password to decrypt them is incurred only once per password-salt combination. A new initialization vector is generated for every upload.