Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve false positive license detection for license lists #2651

Open
pombredanne opened this issue Aug 20, 2021 · 12 comments
Open

Improve false positive license detection for license lists #2651

pombredanne opened this issue Aug 20, 2021 · 12 comments

Comments

@pombredanne
Copy link
Member

We have lists of license identifiers in code or data files that are being detected and lead to many false positive (FP). These are typically list of SPDX identifiers and are mostly found in license-related tools... or package management tools. But these tools are seen everywhere.

See this attached example code from NuGet: NuGetLicenseData.cs.txt with long lists like this:

{ "AGPL-1.0", new LicenseData(licenseID: "AGPL-1.0", isOsiApproved: false, isDeprecatedLicenseId: true, isFsfLibre: true) },
{ "AGPL-1.0-only", new LicenseData(licenseID: "AGPL-1.0-only", isOsiApproved: false, isDeprecatedLicenseId: false, isFsfLibre: false) },
{ "AGPL-1.0-or-later", new LicenseData(licenseID: "AGPL-1.0-or-later", isOsiApproved: false, isDeprecatedLicenseId: false, isFsfLibre: false) },

For now, adding new "false positive" license detection rules has been the solution to deal with list of license keys such as

The problem is a bit related to

For instance the fact that we are in code literals like in #2502 could be a clue that this an FP

We should find a better way than just adding many new false positive rules like in 5f39252 and #2505 ... some ideas:

  • Heuristics when several licenses are detected in alphabetical order using most license names
  • ML ?
  • list of known packages Purl that have these issues (think about what would happen if you scan ScanCode 🙄 )
@pombredanne
Copy link
Member Author

Another example is https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js
This is NOT detected correctly @sschuberth FYI ;)
But the way this si NOT detected could be improved

@yichong96
Copy link

Hi @pombredanne may I work on this issue ? I am a recent Computer Science graduate at NUS and I would like to contribute to this project :-)

@pombredanne
Copy link
Member Author

@yichong96 Yes and thank you!
This is not a trivial problem and it needs some design; for this the best start would be to outline and document your approach here (or in a separate document) so we can help and guide you.

@pombredanne
Copy link
Member Author

https://github.com/spdx/license-list-data/tree/master/json is another source of false positive, e.g. lists of licenses

Here the key issue stems from sequences of license matches that are typically:

  • only to short license references and tags (eg is_license_reference or is_license_tag rules)
  • matched rules are usually for a single license key and not full expressions
  • usually one match per line, with few or no line in between
  • or one match after the other on the same line and that match to the same license key
  • usually where the matched text are sorted in alphabetical order

The starting point is a sequences of license matches for a file. I would likely start with something that would walk over a list of matches, taking 5 or 6 at a time, and check the properties above.

@yichong96
Copy link

Hi @pombredanne thank you for your guidance. Am still trying to understand the code base. I would like to ask a question regarding license matching. In the match_query function, there are 3 types of matching that are run; Hashing, specifically matching spdx license and automaton matching. However, the flag to run spdx_license matching as_expression is set to False. Why is that the case ?

Regarding Hash matching of licenses, am I right to say that the only way whereby there is a match will be the case when the rule and the query file have identical words ?

@pombredanne
Copy link
Member Author

In the match_query function, there are 3 types of matching that are run; Hashing, specifically matching spdx license and automaton matching. However, the flag to run spdx_license matching as_expression is set to False. Why is that the case ?

The main entry point is Index.match() not Index.match_query() ... see in https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L781

But in all case, the as_expression flags means that the whole file or string is treated as a license expression. This is a special case that's used for matching only license expression, for instance in a package manifest.

The actual processing goes rather through these steps:
A. Using the howl file (e.g., whole query)

  1. hash match on the whole query e.g. whole file and yes its means having identical words with a rule, ignoring case, punctuation and spacing: https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L865
  2. automaton matches and then "SPDX license indentifier" matches: https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L887

B. Typically also do approximate matching in https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L894

  1. first on the whole query https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L613
  2. then on the query broken into logical chunks called "runs" https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L670

The approximate matching is a 2 step process: inverted index matching using "sets"/bag or words for ranking and multiple sequence alignment based on this ranking.

With all that, the things you should care here IMHO are a sequence of LicenseMatch https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L109 and create a new filter function like this one https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L1283 that would weed out the identified false matches. There are several examples of these functions.

Then you can add this function in refine_matches() https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L1442

@yichong96
Copy link

@pombredanne Thank you for taking the time to explain :-) Yup agreed that filtering matches is a starting point.

@pombredanne
Copy link
Member Author

@yichong96 re:

Thank you for taking the time to explain

Actually, thank you for taking the time to dig in this and study and understand this mess! ;) That's much appreciated.

@yichong96
Copy link

yichong96 commented Sep 29, 2021

@pombredanne with regard to issue #1032, the automaton matches spdx rule files to these spdx ids. There are many of such spdx rule files such as spdx_license_id_spl-1.0_for_spl-1.0.RULE with only the license ID. Is it possible to only include these spdx rule files when there appears a "spdx identifier : spdx ID" line ? It seems that matching just the spdx license ID to the query file without any context does not really give indication of the license it is using. Im not sure how many false negatives this might introduce though.

Perhaps another way would be to define a certain window of tokens before and after the token positions of the found license and then classify whether this string is code or plain text. If it is code then probably it is not a valid license. I just saw your reply on the thread here https://github.com/nexB/scancode-toolkit/issues/2304#issuecomment-718676722
)
. It seems that there were some considerations on this before.

@yichong96
Copy link

yichong96 commented Sep 30, 2021

Here the key issue stems from sequences of license matches that are typically:

only to short license references and tags (eg is_license_reference or is_license_tag rules)
matched rules are usually for a single license key and not full expressions
usually one match per line, with few or no line in between
or one match after the other on the same line and that match to the same license key
usually where the matched text are sorted in alphabetical order

Hi @pombredanne would like to clarify on some of the points made in this reply.

only to short license references and tags (eg is_license_reference or is_license_tag rules)

I think this example is quite clear to me. It is illustrated in the NugetLicenseData.txt and in the files mentioned in #1032.

matched rules are usually for a single license key and not full expressions

Does "for a single license key" mean that there are multiple sequences of LicenseMatch with the same license_expression? What does not full expression mean?

usually one match per line, with few or no line in between
or one match after the other on the same line and that match to the same license key

I think these 2 points refer to the file in https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js for the var examples portion of the code right ?

usually where the matched text are sorted in alphabetical order

https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js somewhat illustrates this point starting from the "AGPL 3" element in the var examples list ?

Thank you @pombredanne !

@pombredanne
Copy link
Member Author

Does "for a single license key" mean that there are multiple sequences of LicenseMatch with the same license_expression? What does not full expression mean?

This is rather that a LicenseMatch would have to be for a single license, e.g. the rule should be for something like mit but not something for mit OR gpl-2.0.

Also to take into account would be the fact that matches are mostly to rules tagged with is_license_reference or is_license_flag flags... We could also introduce a specific rule flag like is_license_name to tag short rules that are just for a license key or name

I think these 2 points refer to the file in https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js for the var examples portion of the code right ?

yes, that's a typical example.

https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js somewhat illustrates this point starting from the "AGPL 3" element in the var examples list ?

yes :)

@pombredanne
Copy link
Member Author

@sschuberth I know this is something you raised in https://github.com/oss-review-toolkit/ort/wiki/Developer-Meeting

Note that https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js is currently NOT detected correctly in the latest ScanCode... this ticket here is to make this better and more robust

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants