Everybody agrees that it’s important to keep our customers’ data safe and secure from attackers. A breach would entail not only possible legal ramifications, but also a loss of reputation and of user trust. No company is impenetrable, however, and regardless of how great their security is, breaches happen. Just a few months ago, Heartbleed tore the internet apart, and no company could have prepared themselves against it.
It’s important to guarantee that an attacker who gains access to your database doesn’t actually gain access to any sensitive information, and that doesn’t just mean encrypting it! Securely storing data is more difficult than you’d expect, and in this post we’re going to discuss some of the methods that exist for protecting data and what ‘gotchas’ you might need to be aware of.
In writing this post, we’ve tried to be opinionated about our recommendations and we’ve assumed that cryptography is not the core purpose of your application. If you’re writing a secure backup service like Tarsnap, or a payment processor like Square, you are likely not the target audience for this post, and should seek the advice of a professional crypto-implementer. Given tradeoffs between theoretical security and ease of implementation, we’ve opted to recommend whichever option is harder to get wrong, while still providing good security and using proven algorithms. We always recommend using libraries like libsodium over working with cryptographic primitives directly.
Take a moment to think about passwords. Intuitively, if I want to verify that somebody knows a given password, I can ask them what that password is. This works, but it requires that I know the password myself, and in the case of databases, that means storing it somewhere. However, if a breach were to occur, all of these stored passwords would then be stolen. What we need is a way of comparing two pieces of data without knowing what either of those pieces are. It’s not an easy problem to solve, but fortunately, you don’t need to – mathematicians solved it decades ago.
For cases like this, you need a cryptographic hash function. Hash functions take a piece of data as input, and return a string known as a digest, or a hash. The most important property about hashes is that it is effectively impossible to recover the original data used to produce a given hash. This means that given the hash of a password, an attacker can’t compute the password itself. The concept of hashing doesn’t only apply to passwords, and can be used for any data you want to compare against, and don’t mind irrecoverably destroying (while preserving the digest). Examples include password reset tokens, two-factor backup codes, and anything else that can be used to gain access to an account.
These functions belong to a family of functions known as key derivation functions: they use hashing to produce a digest which is suitable for use as a password hash. Furthermore, they make use of something called a salt, meant to protect against rainbow table attacks – an attack wherein a cracker uses a gigantic dictionary of common words and their corresponding hashes to try finding a password from a given hash. It’s a very effective attack, with top password-cracking teams making heavy use of it to crack upwards of 50,000 passwords over 48-hours at the Crack Me If You Can Contest in 2013.
Salts work by ensuring that a single password can hash to multiple outputs, preventing an attacker from precomputing a table of all the possible passwords, and their hashes. For this reason, it is important that each password has a long, unique salt, and also important that the salt is not considered a secret: the salt is required to reconstruct a user’s password hash, and so must be stored in a manner such that it can be retrieved by the server every time the user attempts to login.
It sounds like a lot to keep track of, but most of the work is already handled by algorithms like PBKDF2, and their respective implementations in most programming languages. There are three main choices for key-derivation functions: scrypt, bcrypt, and PBKDF2. The differences between the three are subtle enough that we recommend you use whichever has the strongest library support in your language.
Sometimes, you’ll find yourself storing private information which you’ll later need to fully extract in order to make use of; effectively, you need this information to be recoverable. For example, if you are storing PDF documents that your customers have uploaded, you may need to display the documents back to your customers at a later point. In this case, you’ll need to use encryption rather than hashing.
Encryption is fundamentally different from hashing in that it is reversible. The data is encrypted using secret keys, which can then be later used to decrypt the data. It is important that the encryption keys are stored separately from the data itself. This way, even if an attacker gains access to your database, they won’t have access to the keys needed to decrypt it. We recommend storing the encryption keys in environment variables, which are never checked into version control.
If you’re storing something in a database with the intent of retrieving it again later, you need symmetric encryption: the most common example of which is provided by AES. AES is hard to get right, with different modes like EBC, CBC, CTR, and EAX making it nearly impossible for anybody to use it properly. The easiest way to avoid misusing AES, is to not use AES. Instead, use the symmetric encryption functionality of libsodium. It uses Salsa20 as its algorithm, but all of that is abstracted away from the user. There’s bindings for nearly every platform, and for the vast majority of use-cases, this library should handle everything for you. One important thing to note, is that each row of the database should be encrypted using a different key: this will prevent the whole database from being compromised if one key is leaked. You can do this similarly to the way you handled password storage – generate some secret key, and associate a unique salt to each row of your database. Now you can use a key-derivation function with that secret key, and the row’s salt, to derive a new key specific to that row.
If you have the fortune of being able to separate producers and consumers of your sensitive data, you can use a technique named asymmetric encryption, or public-key cryptography. The two most common algorithms for asymmetric encryption are RSA, and ECC, but again, there are enough variables at play that we don’t recommend handling the cryptography on your own. Instead, go with libsodium. If you’re noticing a trend of us recommending that you avoid writing your own crypto, it’s because you should never write your own crypto. Seriously.
Depending on what you need to do with your sensitive data, you have a handful of tools available to secure it. If you need to destroy the original data but check against it for correctness later, like for a user’s password, a hash works well. If you need to retrieve the plaintext data later (as is the case for sensitive documents or personally identifiable information) you can encrypt it.
There is a lot of misinformation out there, so we put together this post as a set of guidelines on choosing the right tool for the job. Don’t use hashing when you need encryption, and vice-versa. With each of these methods, it’s worth pointing out again that it’s absolutely paramount to use a standardized and proven library. Cryptography is hard – so hard, in fact, that even though we’re a security company, we would still never implement our own cryptography code, and neither should you.
As always, we’re happy to help you when you’re making these decisions, so feel free to contact us.
To help stop attackers from breaching your website in the first place, Synopsys provides the best web application security solution on the market, tailored for your developers and DevOps teams.