Improve false positive license detection for license lists #2651

pombredanne · 2021-08-20T08:27:21Z

We have lists of license identifiers in code or data files that are being detected and lead to many false positive (FP). These are typically list of SPDX identifiers and are mostly found in license-related tools... or package management tools. But these tools are seen everywhere.

See this attached example code from NuGet: NuGetLicenseData.cs.txt with long lists like this:

{ "AGPL-1.0", new LicenseData(licenseID: "AGPL-1.0", isOsiApproved: false, isDeprecatedLicenseId: true, isFsfLibre: true) },
{ "AGPL-1.0-only", new LicenseData(licenseID: "AGPL-1.0-only", isOsiApproved: false, isDeprecatedLicenseId: false, isFsfLibre: false) },
{ "AGPL-1.0-or-later", new LicenseData(licenseID: "AGPL-1.0-or-later", isOsiApproved: false, isDeprecatedLicenseId: false, isFsfLibre: false) },

For now, adding new "false positive" license detection rules has been the solution to deal with list of license keys such as

The problem is a bit related to

Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish #2403
... in the sense that the context of where we find a licenses matters: for bare words, the case, being surrounded by gibberish or found in a binary may be a false positive clue.

For instance the fact that we are in code literals like in #2502 could be a clue that this an FP

We should find a better way than just adding many new false positive rules like in 5f39252 and #2505 ... some ideas:

Heuristics when several licenses are detected in alphabetical order using most license names
ML ?
list of known packages Purl that have these issues (think about what would happen if you scan ScanCode 🙄 )

The text was updated successfully, but these errors were encountered:

pombredanne · 2021-08-25T08:34:35Z

Another example is https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js
This is NOT detected correctly @sschuberth FYI ;)
But the way this si NOT detected could be improved

yichong96 · 2021-09-21T01:09:27Z

Hi @pombredanne may I work on this issue ? I am a recent Computer Science graduate at NUS and I would like to contribute to this project :-)

pombredanne · 2021-09-21T07:14:55Z

@yichong96 Yes and thank you!
This is not a trivial problem and it needs some design; for this the best start would be to outline and document your approach here (or in a separate document) so we can help and guide you.

pombredanne · 2021-09-23T14:56:11Z

https://github.com/spdx/license-list-data/tree/master/json is another source of false positive, e.g. lists of licenses

Here the key issue stems from sequences of license matches that are typically:

only to short license references and tags (eg is_license_reference or is_license_tag rules)
matched rules are usually for a single license key and not full expressions
usually one match per line, with few or no line in between
or one match after the other on the same line and that match to the same license key
usually where the matched text are sorted in alphabetical order

The starting point is a sequences of license matches for a file. I would likely start with something that would walk over a list of matches, taking 5 or 6 at a time, and check the properties above.

yichong96 · 2021-09-28T00:23:39Z

Hi @pombredanne thank you for your guidance. Am still trying to understand the code base. I would like to ask a question regarding license matching. In the match_query function, there are 3 types of matching that are run; Hashing, specifically matching spdx license and automaton matching. However, the flag to run spdx_license matching as_expression is set to False. Why is that the case ?

Regarding Hash matching of licenses, am I right to say that the only way whereby there is a match will be the case when the rule and the query file have identical words ?

pombredanne · 2021-09-28T15:44:51Z

In the match_query function, there are 3 types of matching that are run; Hashing, specifically matching spdx license and automaton matching. However, the flag to run spdx_license matching as_expression is set to False. Why is that the case ?

The main entry point is Index.match() not Index.match_query() ... see in https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L781

But in all case, the as_expression flags means that the whole file or string is treated as a license expression. This is a special case that's used for matching only license expression, for instance in a package manifest.

The actual processing goes rather through these steps:
A. Using the howl file (e.g., whole query)

hash match on the whole query e.g. whole file and yes its means having identical words with a rule, ignoring case, punctuation and spacing: https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L865
automaton matches and then "SPDX license indentifier" matches: https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L887

B. Typically also do approximate matching in https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L894

first on the whole query https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L613
then on the query broken into logical chunks called "runs" https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L670

The approximate matching is a 2 step process: inverted index matching using "sets"/bag or words for ranking and multiple sequence alignment based on this ranking.

With all that, the things you should care here IMHO are a sequence of LicenseMatch https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L109 and create a new filter function like this one https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L1283 that would weed out the identified false matches. There are several examples of these functions.

Then you can add this function in refine_matches() https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L1442

yichong96 · 2021-09-29T02:53:54Z

@pombredanne Thank you for taking the time to explain :-) Yup agreed that filtering matches is a starting point.

pombredanne · 2021-09-29T06:34:44Z

@yichong96 re:

Thank you for taking the time to explain

Actually, thank you for taking the time to dig in this and study and understand this mess! ;) That's much appreciated.

yichong96 · 2021-09-29T06:46:17Z

@pombredanne with regard to issue #1032, the automaton matches spdx rule files to these spdx ids. There are many of such spdx rule files such as spdx_license_id_spl-1.0_for_spl-1.0.RULE with only the license ID. Is it possible to only include these spdx rule files when there appears a "spdx identifier : spdx ID" line ? It seems that matching just the spdx license ID to the query file without any context does not really give indication of the license it is using. Im not sure how many false negatives this might introduce though.

Perhaps another way would be to define a certain window of tokens before and after the token positions of the found license and then classify whether this string is code or plain text. If it is code then probably it is not a valid license. I just saw your reply on the thread here https://github.com/nexB/scancode-toolkit/issues/2304#issuecomment-718676722
). It seems that there were some considerations on this before.

yichong96 · 2021-09-30T07:29:55Z

Here the key issue stems from sequences of license matches that are typically:

only to short license references and tags (eg is_license_reference or is_license_tag rules)
matched rules are usually for a single license key and not full expressions
usually one match per line, with few or no line in between
or one match after the other on the same line and that match to the same license key
usually where the matched text are sorted in alphabetical order

Hi @pombredanne would like to clarify on some of the points made in this reply.

only to short license references and tags (eg is_license_reference or is_license_tag rules)

I think this example is quite clear to me. It is illustrated in the NugetLicenseData.txt and in the files mentioned in #1032.

matched rules are usually for a single license key and not full expressions

Does "for a single license key" mean that there are multiple sequences of LicenseMatch with the same license_expression? What does not full expression mean?

usually one match per line, with few or no line in between
or one match after the other on the same line and that match to the same license key

I think these 2 points refer to the file in https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js for the var examples portion of the code right ?

usually where the matched text are sorted in alphabetical order

https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js somewhat illustrates this point starting from the "AGPL 3" element in the var examples list ?

Thank you @pombredanne !

pombredanne · 2021-10-08T07:25:05Z

Does "for a single license key" mean that there are multiple sequences of LicenseMatch with the same license_expression? What does not full expression mean?

This is rather that a LicenseMatch would have to be for a single license, e.g. the rule should be for something like mit but not something for mit OR gpl-2.0.

Also to take into account would be the fact that matches are mostly to rules tagged with is_license_reference or is_license_flag flags... We could also introduce a specific rule flag like is_license_name to tag short rules that are just for a license key or name

I think these 2 points refer to the file in https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js for the var examples portion of the code right ?

yes, that's a typical example.

https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js somewhat illustrates this point starting from the "AGPL 3" element in the var examples list ?

yes :)

pombredanne · 2021-10-08T07:26:44Z

@sschuberth I know this is something you raised in https://github.com/oss-review-toolkit/ort/wiki/Developer-Meeting

Note that https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js is currently NOT detected correctly in the latest ScanCode... this ticket here is to make this better and more robust

pombredanne added license scan new feature improve-license-detection labels Aug 20, 2021

pombredanne mentioned this issue Aug 23, 2021

Review license detections of Chromium #2658

Open

pombredanne mentioned this issue Mar 5, 2022

RFC: a plan for false positive license detection #2878

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve false positive license detection for license lists #2651

Improve false positive license detection for license lists #2651

pombredanne commented Aug 20, 2021

pombredanne commented Aug 25, 2021

yichong96 commented Sep 21, 2021

pombredanne commented Sep 21, 2021

pombredanne commented Sep 23, 2021

yichong96 commented Sep 28, 2021

pombredanne commented Sep 28, 2021

yichong96 commented Sep 29, 2021

pombredanne commented Sep 29, 2021

yichong96 commented Sep 29, 2021 •

edited

Loading

yichong96 commented Sep 30, 2021 •

edited

Loading

pombredanne commented Oct 8, 2021

pombredanne commented Oct 8, 2021

Improve false positive license detection for license lists #2651

Improve false positive license detection for license lists #2651

Comments

pombredanne commented Aug 20, 2021

pombredanne commented Aug 25, 2021

yichong96 commented Sep 21, 2021

pombredanne commented Sep 21, 2021

pombredanne commented Sep 23, 2021

yichong96 commented Sep 28, 2021

pombredanne commented Sep 28, 2021

yichong96 commented Sep 29, 2021

pombredanne commented Sep 29, 2021

yichong96 commented Sep 29, 2021 • edited Loading

yichong96 commented Sep 30, 2021 • edited Loading

pombredanne commented Oct 8, 2021

pombredanne commented Oct 8, 2021

yichong96 commented Sep 29, 2021 •

edited

Loading

yichong96 commented Sep 30, 2021 •

edited

Loading