-
-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve false positive license detection for license lists #2651
Comments
Another example is https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js |
Hi @pombredanne may I work on this issue ? I am a recent Computer Science graduate at NUS and I would like to contribute to this project :-) |
@yichong96 Yes and thank you! |
https://github.com/spdx/license-list-data/tree/master/json is another source of false positive, e.g. lists of licenses Here the key issue stems from sequences of license matches that are typically:
The starting point is a sequences of license matches for a file. I would likely start with something that would walk over a list of matches, taking 5 or 6 at a time, and check the properties above. |
Hi @pombredanne thank you for your guidance. Am still trying to understand the code base. I would like to ask a question regarding license matching. In the Regarding Hash matching of licenses, am I right to say that the only way whereby there is a match will be the case when the |
The main entry point is Index.match() not Index.match_query() ... see in https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L781 But in all case, the The actual processing goes rather through these steps:
B. Typically also do approximate matching in https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/index.py#L894
The approximate matching is a 2 step process: inverted index matching using "sets"/bag or words for ranking and multiple sequence alignment based on this ranking. With all that, the things you should care here IMHO are a sequence of LicenseMatch https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L109 and create a new filter function like this one https://github.com/nexB/scancode-toolkit/blob/152abdaa73d1b8203a6f3cc6057d6c31c7c49e2b/src/licensedcode/match.py#L1283 that would weed out the identified false matches. There are several examples of these functions. Then you can add this function in |
@pombredanne Thank you for taking the time to explain :-) Yup agreed that filtering |
@yichong96 re:
Actually, thank you for taking the time to dig in this and study and understand this mess! ;) That's much appreciated. |
@pombredanne with regard to issue #1032, the automaton matches spdx rule files to these spdx ids. There are many of such spdx rule files such as Perhaps another way would be to define a certain window of tokens before and after the token positions of the found license and then classify whether this string is |
Hi @pombredanne would like to clarify on some of the points made in this reply.
I think this example is quite clear to me. It is illustrated in the
Does "for a single license key" mean that there are multiple sequences of
I think these 2 points refer to the file in https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js for the
https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js somewhat illustrates this point starting from the "AGPL 3" element in the Thank you @pombredanne ! |
This is rather that a Also to take into account would be the fact that matches are mostly to rules tagged with
yes, that's a typical example.
yes :) |
@sschuberth I know this is something you raised in https://github.com/oss-review-toolkit/ort/wiki/Developer-Meeting Note that https://raw.githubusercontent.com/jslicense/spdx-correct.js/master/test.js is currently NOT detected correctly in the latest ScanCode... this ticket here is to make this better and more robust |
We have lists of license identifiers in code or data files that are being detected and lead to many false positive (FP). These are typically list of SPDX identifiers and are mostly found in license-related tools... or package management tools. But these tools are seen everywhere.
See this attached example code from NuGet: NuGetLicenseData.cs.txt with long lists like this:
For now, adding new "false positive" license detection rules has been the solution to deal with list of license keys such as
The problem is a bit related to
... in the sense that the context of where we find a licenses matters: for bare words, the case, being surrounded by gibberish or found in a binary may be a false positive clue.
For instance the fact that we are in code literals like in #2502 could be a clue that this an FP
We should find a better way than just adding many new false positive rules like in 5f39252 and #2505 ... some ideas:
The text was updated successfully, but these errors were encountered: