Or: The ugly baby phenomenon and why you should not focus on false positives
Dr. Markus Schumacher has served as CEO and Co-Founder of Virtual Forge GmbH since 2006. The company specializes in the security of SAP applications. Dr. Schumacher was previously a representative of the Fraunhofer Institute for Secure Information Technology (SIT) and worked at SAP as a Security Product Manager (NetWeaver). Focus topics were secure development, security testing, security response, product certification (Common Criteria) as well as awareness events for the development crew. Before SAP, Dr. Schumacher, was a member of the scientific staff at the IT Transfer Office (ITO), Department of Computer Science, Darmstadt University of Technology, where he managed projects for customers such as T-Systems Nova, Siemens AG, SAP Corporate Research, and Fujitsu Laboratories. Dr. Schumacher earned his doctorate in computer science field. He has published numerous articles and books (most recently: Secure ABAP Programming at SAP Press) and speaks regularly at international conferences.
Markus met with Gary during his latest stay in Germany. After talking about software security in certain nice places in Heidelberg, the idea came up to capture some insights about software security testing in an interview. Here’s the interview as recorded on Wednesday, April 6, 2011.
Markus: Gary, we talk about software security today, in particular about finding bugs by thorough security testing. How should tests be conducted? Manually or with a tool? And which approach is better?
Gary: This is a little bit like comparing apples and oranges because both approaches can be very useful. But generally speaking, if you can automate a particular test that means that you’ll be able to apply that test consistently in the future—maybe even across your entire code base. So I’m a big fan of automating as much of testing as you can automate.
Security testing is good, but if people treat it as a ‘security meter’ that can lead to real problems. That is, confused people sometimes think that if they run automated tests and don’t find any problems that the software is free of bugs. But we both know that a result like this just means that you haven’t found anything interesting during a given test. You have to be very careful when you apply automated testing that you know what you are doing and that in the end you know what the results are.
Does that make sense to you?
Markus: That makes perfect sense. You have observed the software security tools market for many years. We see black box scanning tools and code scanning tools out there today; what are the trends that you have observed?
Gary: When it comes to software security, there are basically two kinds of automation. One kind does black box testing and requires that your software be run. We call that dynamic testing. The idea is to test your program automatically while it’s running by providing input and see if you can maliciously break it. There are such tools aimed at Web applications, IBM AppScan for example. The second type of tool is a code scanning tool that does a static analysis. A code scanning tool looks at your code instead of running your program. That is, it looks for bugs that are observable in the code itself.
Both of the two types of tools are the biggest sellers in the software security space today. What happened over the last few years is that the code scanning tools became a lot better and they’ve begun to find widespread use. In fact they accelerated past the black box testing tools in terms of adoption a year or two ago. The reason for that is that black box tools only work for Web applications (they work only over http) while the source code tools work for any kind of software. As you know, there are even specialty white box tools that look over particular languages—like the languages that are built into highly popular systems like SAP.
The ABAP tool that you guys have built is a way of looking at your ABAP code to find bugs and produce security results. In my view it’s really better to focus as early in the lifecycle as you can to find bugs and any static analysis tool can really help to do that. Bottom line: here are many advantages using such static tools over and above dynamic tools.
Markus: Because you build security in from day one.
Gary: That’s part of the idea. Of course you have to think about your design as well. But we haven’t figured out how to automate looking for design flaws yet!
Markus: Don’t write software – then you are good.
Gary: Sadly, that’s true.
Markus: Code analysis tools are obviously a good choice. But what are their limitations?
Gary: There are a couple of things that are problematic. One is that people think that the tool will find all possible bugs and then fix the bugs for you. That can be an issue. The thing about these tools is mostly they help you finding possible vulnerabilities and then you have to be smart about determining whether what has been found is a real problem or not. But even more important than that, thinking about how to fix vulnerabilities is a serious problem. If current tools have a limitation it’s that they don’t fix the code, and certainly not automatically! So they are great at finding bugs but it’s up to you to fix them. If you just use these tools to find bugs, pile them up somewhere, and not fix them that doesn’t help security at all.
The other problem with these tools is that because they are doing static analysis they do have the tendency to sometimes find false positives—things that the tool thinks are a problem but it turns out when you think about data flow more carefully (or whatever) they are not a problem. Plenty of people worry about the false positives problem, but I have seen the number of false positives that static tools produce over the last 5 or 7 years drop dramatically. It’s in an acceptable range now I believe.
Markus: I have talked to different clients about false positives. One of them said, ‘tools find issues – some might be false positives, others not – we review them and fix the bugs.’ Others say, ‘for many reasons – I can’t have any false positives even if the tool is sometimes finding real bugs.’ For them it’s better to not see a real bug in favor of a low false positive rate. What would you say to the latter?
Gary: “I’m with the first guys. It’s much better to have a few false positives and find all of your security problems than it is to have no false positives and miss real security problems. This is because security problems are serious and they need to get fixed!
The notion of a code scanning tool sprang from a whole bunch of experience with manual code reviews—digging through code by hand and looking for security bugs. We were doing a lot of that in 1998 and 1999 and we began to figure out a ways to automate parts of that. We created the first code scanning tool for security called ITS4. Things have come a very long way since then, but remember that ITS4 was just using grep-like technology looking for very simple patterns and sometimes you can get simple patterns completely wrong.
Things have improved a huge amount since those days. I think when people talk about false positives in some sense they are using thinking that is about 10 years old (from the ITS4 days). Today the false positive rate has dropped enough that using these tools is something you really just have to do.
Markus: Our strategy of lowering the false positive rate is to apply data-flow analysis consistently, doing many sanity checks like type checking, looking for authority checks, etc. That way we classify the findings – there are certain findings where we are pretty sure will always find real bugs while others are probably not as certain and get a lower rating …
Gary: I think that’s a very good idea.
Markus: … Is this approach a good strategy? That is, starting with the findings that have a very high rating first?
Everyone has a limited amount of time to fix their code. The most important thing is not finding the bugs, but fixing them as I have told you before. If you have a way of helping people prioritize the fixing so that they are fixing stuff that really needs to be fixed, that’s fantastic!
What we see in the field is a lot of people find a lot of bugs but not enough people do enough to fix the bugs. There’s not enough remediation going on. Let’s be clear: it does no good to find bugs if you are not going to fix them. And so I think a focus on telling people ‘this is a bug for sure, and you should fix this one because you won’t waste any of your valuable time’ is a very, very clever strategy.
Markus: We know people who say that such ‘very high’ findings are very likely true positives and consider all others with a lower rating as a false positive because they need to invest too much time on finding out whether they are bugs or not. Accordingly they claim that the false positive rate is too high and a tool might be useless because it doesn’t deliver 100% hits only. Why is it not a good idea to shoot at this false positive thing only?
Gary: If these people are fixing all of the bugs that you are telling them are bugs for sure and have extra time left over, then they can worry about that problem! But so far I haven’t seen anybody who has the luxury of that much time. That means their whole point is sort of a moot point. The answer should be: fix the ones that you know are a problem, and when you are done with that we’ll talk.
Markus: Good answer, next question.
Many people get frustrated when they start security testing because of the high amount of findings as result of initial scans. How should people approach this?
Gary: The best way to do this is to turn the things that you are looking for on and off inside the tool. When you try to get people to adopt a tool for the first time, it’s better to have the tool looking for certain categories of bugs (I recommend this be as few as possible). The idea is to make sure that the tool doesn’t just overwhelm the user with a big ‘red screen of death.’
There are a couple of clever ways of doing this. We help many companies adopting such tools wisely throughout their whole development team. One very good trick is to tie the tools to code that the users want to use already. So you have a middleware framework and you want people to use that, then you build some enforcement rules to talk about the use of that particular code, and you focus on that instead of focusing on looking for all bugs at all time throughout the entire code base.
Another way of putting this notion is: tighten the focus of the tool so that it isn’t overwhelming at first, and then loosen that focus up, add more rules, add more kinds of bugs you are looking for over time. Start small. As the code base improves and people get better in using the tool, do more.
Markus: We have a customer following a similar strategy. They did an initial scan with all checks turned on. Then they identified all checks that lead to no findings and made those tests mandatory. Meaning: they are good in this area and they won’t get worse. And then they tightened the focus as you have described it. Like it?
Gary: That’s a good idea, because it’s sort of belts and suspenders approach (so to speak). The idea of working for certain categories of bugs should also be complemented by understanding your code base. If you run a bunch of static analyses, you should amass enough data to determine what your number one bug is. Note that your number one bug may different than somebody else’s number one bug! Then you can set out on a bug eradication mission based on real data from a tool run over your code base, and that’s a very helpful thing.
Remember, if you are finding bugs in your code that means somebody is typing in those bugs— somebody actually wrote that bug. The best thing is to get to that person and teach them not to do it that way. The closer you can get this to the developer’s head (and fingers) the better off you’ll be in my experience.
Markus: But that could be the reason for the resistance. Somebody blames the bug writer for their bad code, their (broken) piece of work. And probably companies do not have a well-developed way of dealing with accidental mistakes.
Gary: That’s right. One problem in security that we have is that developers like their code and treat it like it’s their baby. Then you come along and say, ‘That’s the ugliest baby I’ve ever seen!’ And that makes the developers angry. You really shouldn’t call somebody’s baby ugly, but in security we run around doing that all the time.
We have to understand that people are very sensitive about their code, and we have to be gentle about security problems and teach them that it’s in everybody’s best interest to find and fix these things. The good news is that most developers actually really want to build good stuff. If you say, ‘This is for helping you build better stuff. It’s not something to smack you around and make you look like an idiot, in fact it makes you build better code,’ that fits into the development culture way better.
Markus: Stay away from the ugly-baby guys and support the better developer. I like that.
Another thought on false positives. Sometimes people say that a certain finding is a false positive because there’s no data path to the vulnerability or the code touches non-critical data only. Think of a SQL injection in code that handles temporary data only. A tool cannot make a good decision here. What’s you view on this?
Gary: The answer is a bit convoluted. Because of code reuse and because people will repurpose code in surprising ways, it’s always better to fix those problems. Even if you think that in a particular situation a particular vulnerability might not lead to a security issue. Because odds are high that someone will just cut-and-paste it and use it somewhere else. And then it will be a real problem.
Markus: Cut-and-paste is one thing, another is code that is part of an API, function, or report, that might be used by someone else in a different context.
Gary: Absolutely right. That happens an awful lot.
It’s the same as putting a watchdog in code. I have seen people put a watchdog way at the beginning of code looking for certain kinds of input because there’s a vulnerability way down low in the code and they say, ‘if we strip the input so it never gets down there everything will be fine.’ But then later somebody comes along and creates a new execution path to the same vulnerability with the watchdog so far up there that the flow is no longer controlled by the watchdog anymore. Then you’re screwed. That’s sort of the same idea. Bottom line: if you have a bug in your code, you should fix it.
Markus: Period – nothing to add here, just fix it.
Final question. You’re currently work on BSIMM3. What can we expect in the new version?
Gary: We have continued to grow the size of the BSIMM study. We now have now 33 firms in the study and we have done 60 measurements.
What happened last year was kind of surprising. Many of the firms that were already participating in the BSIMM asked us to measure their major divisions. For example we did six measurements inside of Bank of America. If you know that the Bank of America includes Merrill Lynch, Countrywide, and a bunch of other large financial organizations, that’s not such a big surprise. That meant we spent an awful lot of time doing BSIMM analysis inside firms that were already in the BSIMM.
So we have grown the dataset considerably—doubled it, in fact, since BSIMM2.
The other thing that we have started doing is re-measuring firms that we have already measured in the past. We have measured 10 firms already again. So now we have data that show what happens to a software security initiative over time, and we can talk about what changed between the first and the second measurements. That’s incredibly cool, very powerful data.
Our plan for BSIMM3 is to try to get up to 40 firms and then release the longitudinal data (that is, the data over time) and the new data set with 40 firms all at the same time. I’m hoping to do that in the early summer.
Markus: Is there hope? Are things getting better?
Gary: Things are getting better. 15 years ago nobody really cared about software security. When Viega and I wrote Building Secure Software everybody thought we were crazy. A lot has changed since then. Now, developers are beginning to understand that what they do does have a clear impact on security. And a lot of firms are realizing that their customers are expecting the code to be secure. Customers may not really be explicitly saying ‘this has to be secure’ but they do (implicitly) believe that it already is secure! So it’s really important that firms meet the implicit security expectations of their customers. A lot of firms are realizing that.
As a field we have made a huge amount of progress. The other thing that happened in the past 10 years is the rise of static analysis tools that actually work and can be adopted in large enterprises. And finally the BSIMM project is a relatively new venture—we have only been doing the study for a couple of years. The BSIMM is a scientific approach that relies on effective measurement of a firm and its peer group. That way you can compare and track what different many diverse firms are doing. That’s a very, very powerful thing. So we built a community of like-minded firms who are all working very hard and building up software security and are making great progress. We figured out a way to measure that progress and show it in no-uncertain terms. That’s pretty cool.
Markus: Agreed. And we continue supporting BSIMM by translating it to German.
Next time we will talk about our joint invention, the NoMoRed (No More Red traffic lights) tool that deletes all bugs by just clicking a button. I’m looking forward to that. Thank you for your time today.
Transcribed in Heidelberg on April 6, 2011.
Cast (in order of appearance)