Audited vs. automated: What your automated open source tool isn't seeing

Authored by Don Mulrenan, Susan Miller, Rich Kosinski

Nov 21, 2023 / 5 min read

Table of Contents

Getting an inventory of your code versus an audit
Scan vs. audit
Package manager scanning
Signature matching
String searches
Manual inspection
Auditor findings by the numbers
Conclusion

Getting an inventory of your code versus an audit

Black Duck® introduced the concept of managing open source, and the licensing and security risks that come with it, back in 2002. The process and the products have matured over the last two decades. Open source management has now become nearly as commonplace as source code control, whether development shops are using tools such as Black Duck or simply maintaining a spreadsheet of what is in their code.

Automation in software composition analysis (SCA) has made it possible for organizations to stay on top of the ever-increasing amount of open source software being introduced into their products throughout the development life cycle. (Well more than half of a typical product code base is open source.) Including SCA as part of a continuous build and integration process provides managers with direct insight into the license compliance and security profiles of any open source component being incorporated into the build. These tools generally access information found within files that are necessary to build the software and that contain an index or inventory of open source components required by the application. This is a very effective method for open source discovery, as long as the code being analyzed contains these index or inventory files.

In tech merger and acquisition (M&A) transactions where the code is much of the targets’ value, acquirers want to ensure that the components used are properly licensed. If they are not, the buyer might be exposed to legal issues that they will need to look into addressing. An automated tool helps a seller maintain a good inventory of the components in use. And such tools can keep an organization from violating licensing terms as well as provide an idea of the vulnerabilities in the components. But in high-risk scenarios like M&A transactions, an open source audit is also called for as it provides a thorough deep dive into the code, using automated tools as well as human investigation and verification of results to provide the most accurate snapshot view possible.

Scan vs. audit

A scan is a fully automated, push-button approach to SCA. It’s the best approach for an organization to manage its use of open source over time because its inexpensive in terms of developer time and designed to avoid friction in the development process.

A Black Duck open source audit uses a range of best-in-class tools to evaluate the software assets on a one-time basis, providing a snapshot to inform an M&A transaction (or other one-time use cases). The process includes a combination of automated and forensic tools that human auditors employ to achieve the highest-quality results. In the end, expert judgment is required to get the most comprehensive, accurate view possible of software composition.

Package manager scanning

One popular technique for scanning is to look for and report on the open source components listed in build-related files, based upon the language or technology being leveraged. This approach works reasonably well with modern, strictly governed development methodologies.

This type of scanning is relatively straightforward and fairly accurate, but it is not comprehensive. What if index files are not present? What if package managers aren’t being used or developers don’t always stick to process? What if the parts of the code base being acquired predate any of these current development methodologies? Then other techniques are required for scans, and certainly for audits.

Signature matching

Along with the analysis of build-related package manager files, automated tools also have functionality called signature matching. This is a combination of file checksum comparison and other heuristic analysis of file content and directory structures that allows SCA tools to identify source code files and snippets (copied portions of open source code) within them. Snippets can be introduced by human developers and GAI tools such as GitHub’s CoPilot and OpenAI’s ChatGPT.

Signature matching can be effective, if it’s matching to a comprehensive knowledgebase (like the Black Duck KnowledgeBase™, which includes signatures for multiple versions of more than 5M components). But, in part because open source reuses a lot of other open source, the matching can lead to unintentional misidentifications. So whereas package manager scanning tends to under-report, this technique errs in the other direction, and a human auditor is required verify and correct the results.

String searches

Even employing both these techniques, some open source will fall through the cracks. Another method to add to help ensure completeness of the open source inventory is sophisticated string searching. It operates on source code, analyzing some predefined search terms and logic. Human auditors look at the files flagged by this process and identify potential matches to the open source packages where they were taken.

And it’s not just open source code that this process uncovers. There are almost 2,800 documented open source licenses, and our auditors are uncovering new, nonstandard licenses all of the time (as well as third-party commercial code licenses). Sometimes they are well-known licenses that have had their language modified, which impacts the requirements and obligations of the license. Others are completely new licenses created by the copyright holder. Frequently, they are even just one- or two-line statements that tell you it’s ok to use or not use the code in a commercial product. All these scenarios are covered by trained auditors that find these items in audits.

Manual inspection

Our tools would work even without an amazing knowledgebase with over 5 million open source components and details surrounding 200,000+ security vulnerabilities. But the open source ecosystem is dynamic and constantly changing. New components, new licenses, new vulnerabilities, and ongoing changes to existing components are always being introduced, so a knowledgebase is constantly playing catch-up. An auditor reviews the components during the audit and often has to make modifications to reflect the actual details of the component. Over time web pages go stale or disappear. Open source developers or developer communities flourish, die out, or are rolled into other entities, so the task is often to provide a more up-to-date and detailed identification, including input from the auditor themselves. All this is geared toward providing the customer with information to aid in making informed business decisions.

Auditor findings by the numbers

Here are some statistics on the kinds of unique findings that auditors make during audits in typical year.

Nearly 200,000 component identifications were made by way of string search discoveries.
800+ nonstandard licenses were discovered and reported by our auditors.
Around 27,000 components were found by Black Duck auditors and recommended to be added to the Black Duck KnowledgeBase.
Over 1,000 component identifications made by automated tools were manually modified by auditors to align with the discoveries they made via research.

Conclusion

Automation can keep an organization on the right path to compliance. An audit will find things that automation cannot—and when completeness and accuracy of the results are required, in high-risk situations like M&A transactions.

Learn more about open source software audits versus scans