Ask any security guy/gal about how to best mitigate cross-site scripting (XSS), and what is the answer? It’s some variation on validating input. Look at my own writings about this topic, and what will you find? Variations on the input validation theme. Input validation is a great solution for new applications, but it’s a horrible choice for existing applications.
Why this change of heart? Well, this is something that’s been coming for quite a while. I’ve become more and more disillusioned with input validation. Let’s start with some basics.
The first few reasons are well written about. Blacklists, whether syntactic or semantic, suffer the problems of blacklists: They can only look for known bad data. Not only that, but they often prevent good data from being input.
Whitelists are then given as the answer. But I’ve a nagging suspicion that syntactic whitelists are useless except for a small number of highly structured types. What worries me is that when writing several language compilers, I’ve always had cases where I had to write syntax rules that were more lenient than the language allowed and then sort out the problem in semantic analysis. This implementation restriction was due to the limitations of my parsers, and my parsers were a lot more sophisticated then the regexp parsers in many validation frameworks. So syntactic whitelists have to let “bad data” through—not a foolproof solution.
Semantic whitelists? Well, these are great for enumerations, but is all input enumerations?
Don’t even get me started about GET versus POST.
Where does this leave us? Well, it leaves us with writing guidance that looks like a patchwork quilt. You use semantic whitelists here but not there. You use syntactic whitelists for some kinds of data, but they will let bad data through.
Here’s the nail in the input validation coffin: Let’s say you have a sizable application with lots of fields of different types. What are you going to do? Take a swing that you can figure out the right type of input validation? What if you get it wrong? Well, if you get input validation wrong, you just broke some percentage of your existing installed base. How large a percentage, you don’t know.
Why not just go with an output encoding solution? Yes, it seems like you’re snipping off the leaves of the tree. Yes, you need to build some way to tag output so that your test team can test that you’ve snipped off all the leaves.
When I think about this problem as managing risk, am I worried about the N% of the XSS bugs I’ve not fixed or the N% of the input fields I’ve now broken? I dunno about you, but I’d rather not break something that worked and have to “roll back” that fix—thus [re-]opening up an XSS vulnerability.