Skip to content
Philippe Ombredanne edited this page Nov 6, 2018 · 11 revisions

ScanCode is a tool to scan code and detect licenses, copyrights and more

Why ScanCode?

Discovering the origin and license for a software component is important, but it is often much harder to accomplish than it should be because:

  • A typical software project may reuse tens or hundreds of third-party software components,
  • Software authors do not always provide copyright and license information, and
  • Copyright and license information that is provided may be hard to find and interpret.

ScanCode tries to address this issue by offering:

  • A comprehensive code scanner that can detect origin or license information inside codebase files,
  • A simple command line approach that runs on Windows, Linux, and Mac,
  • Your choice of JSON or other output formats (HTML, CSV) for integration with other tools,
  • Well-tested, easy to hack, and well-documented code, and
  • Release of the code and reference data under attribution licenses (Apache 2.0 and CC-BY-1.0)

What does ScanCode Toolkit do?

ScanCode finds the provenance information that is in your codebase with a focus on:

  • Copyright and other origin clues (emails, urls, authors,etc), and
  • License notices and license text with reference information about detected licenses.

Using this data you can:

  • Discover the origin and license of the open source and third-party software components that you use,
  • Create a software component Inventory for your codebase, and
  • Use this data to comply with open source license obligations such as attribution and redistribution.

How does it work?

Given a codebase in a directory, ScanCode will:

  • Collect an inventory of the code files and classify the code using file types,
  • Extract files from any archive using a general purpose extractor,
  • Extract texts from binary files if needed,
  • Use an extensible rules engine to detect open source license text and notices,
  • Use a specialized parser to capture copyright statements,
  • Identify packaged code and collect metadata from packages,
  • Report the results in the formats of you choice (JSON, CSV, etc.) for integration with other tools, or
  • Browse the results using the AboutCode Manager companion app from https://github.com/nexB/aboutcode-manager to assist your analysis.

ScanCode should enable you to identify the “easy” cases on your own, but a software development team will probably need to build internal expertise or use outside experts (like nexB) in many cases.

ScanCode is written in Python and also uses other open source packages.

Alternatives ?

There are several utilities that do some of what ScanCode does - e.g. you can grep files for copyright and license text. This may work well for simple cases - e.g. at the single file level, but we created ScanCode for ourselves because this approach does not help you to see the recurring patterns of licenses and other provenance clues.

Or you can consider other tools such as:

  • FOSSology (open source, written in C, Linux only, GPL-licensed)
  • Ninka (open source, written in Perl, GPL-licensed)
  • Commercially-licensed tools, most written in Java

History

ScanCode was originally created by nexB to support our software audit consulting services. We have used and continuously enhanced the underlying toolkit for six years. We decided to release ScanCode as open source software to give software development teams the opportunity to perform as much of the software audit function as they like on their own.

If you have questions or are interested in nexB-provided training or support for ScanCode, please send us a note at [email protected] or visit http://www.nexb.com/.

We are part of nexB Inc. and most of us are located in the San Francisco Bay Area. Our mission is to provide the tools and services that enable and accelerate component-based software development. Reusing software components is essential for the efficient delivery of software products and systems in every industry.

Thank you for giving ScanCode a try!