-
-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: a plan for false positive license detection #2878
Comments
Hi Philippe, In reference to our collective ORT community meeting, we touch base on the false positive license detection 2 weeks ago on version v30.1.0 where Porsche AG OSO also consolidated a report of false-positive cases. Please find attached the report for your kind reference and review. CC: @sschuberth |
Thank you for taking action here. Another interesting example is okhttp3 because the false positive that is found was actually introduced by yourself @pombredanne ;-) : square/okhttp#4569 , The current file in my example is this one: https://github.com/square/okhttp/blob/parent-5.0.0-alpha.3/okhttp/src/main/resources/okhttp3/internal/publicsuffix/NOTICE . The license LicenseRef-scancode-unknown-license-reference is found in line 4: "It is subject to the terms of the Mozilla Public License, v. 2.0:" I don't understand why in this case MPL v.2.0 is not recognized correctly. |
I created a python parser that can parse the evaluated-model.json file create by the ORT Reporter. |
@porsche-rishisaxena Thank you ++ for the list of false positive in #2878 (comment) ... this is great and actionable! |
These tow look like basic license-related clues, but are not real license statement alright. Here is the detection I get:
This is weird and I got them the same way in both case. Could it be ORT handling things differently in these cases?
Oh well.... as the saying goes, "no good deed goes unpunished!" https://raw.githubusercontent.com/square/okhttp/parent-5.0.0-alpha.3/okhttp/src/main/resources/okhttp3/internal/publicsuffix/NOTICE scans this way:
You will note that I am using these command line options: This overall looks like a case of where we could merge the license intro "subject to the terms" with a following notice. |
This is great! Ideally what I would need is a script that would fetch the code. With that I could run |
@porsche-rishisaxena re: #2878 (comment) The CSV is super useful and I can derive a script to automate re scanning from this too. In your case and @PatteSI case, creating these data required a lot of (useful) work. |
ORT is not handling findings in binaries or sources differently per se, and is taking ScanCode findings mostly as-is (except some post-processing to remedy #2873). But it might be that some project-specific path excludes were applied in that particular case. |
For ORT, if false-positives were addressed via package configurations, we could quite easily extract the @fviernau, is that something that could be done from HERE's (probably massive) amount of package configurations? |
If there is something that can be shared, that could be used to fix massively some of these false positive! :) |
@pombredanne I attach the false positives from a bunch of HERE curations, as produced by @PatteSI 's script. |
To be frank, I believe having a live call with all reporters of false-positive mentioned here would be overkill. Also, I guess most people don't care too much how their issue is fixed as long as it is fixed. From my side, however, I'd strongly vote against hard-coding just the reported cases as false-positives. Instead, we should
|
Be careful to not fall into the "perfect is the enemy of good" trap. If trying to avoid the false positives from happening in the first place significantly complicates the code (making it harder to maintain/change/etc.) then I don't see a problem with hardcoding the false positives. But this depends on how many of the results are false positives. @pombredanne do you have an idea of the scale of false positives? How many results are false positives? 1%? 10%? 0.0000001%? |
That's exactly the point. Based on feedback from ORT users, the false-positive rate for at least the "free-unknown", "unknown-license-reference" and "proprietary-license" license keys seems to be close to 100%. |
@sschuberth re:
My hunch and anecdotal evidence is that this is surely not close to 100% by a large margin. But I may be wrong. |
I could not identify a single "true" finding in our projects so that there was an actual hint to an "unknown license". You have 2 big text files here in this thread from 2 different persons with potential false positive findings that are categorized as "unknown" license references. I challenge you to show me a real finding in those files that are actual true hints to an unknown license behind some of those results. |
I guess that there might be an interpretation issue here about what "unknown" means. Is it "not a known open source license" or "a license that Scancode couldn't determine". Correct me if I am wrong, but I think that you mean the former. For me it is definitely the latter. Could you clarify? Depending on what is meant there are different solutions (if there are). It might be good to look at what other scanners did. A good example is Ninka, which is no longer maintained, but which I used extensively quite a few years ago. The goal of Ninka was not to detect as many licenses as possible, but to detect them with high fidelity. If Ninka wasn't very sure about a license, it would throw its hands up and say "I don't know" and report the license as "unknown". FOSSology on the other hand would report a license, but could be completely wrong for those files. So what it in my opinion comes down to: do you want to have licenses reported with high fidelity, at the cost of a bigger number of "unknown", or do you want to have a license reported with lower fidelity but very few "unknown"? |
Hi @pombredanne NoLicenseDetection-report.xlsx CC: @sschuberth |
@PatteSI Thank you ++ that's super valuable input. |
Thanks. Super useful too. |
It's not about "open source" license. I am pretty sure we can detect almost all known open source licenses. It's about "hints" to "unknown" (usually proprietary) licenses. As far as I know there is no standard on what wording has to be used in a source code file in order to place it under some arbitrary license. I am not even sure if one has to use the word "license" in order so do so. So basically if some troll wanted to place certain parts of code under a proprietary license while the rest of the project is under a different known license he could do that with some wording or weird character encoding obfuscating the automatic detection of this section. So I would argue that we always have to do a trade-off here if we want to talk about "unknown" licenses as it will never be possible to 100%. Like you said we need to rely on heuristics that hopefully will trigger on wording that someone is using when he is announcing proprietary license (not yet available in any database) while having a high fidelity in such findings. In the end the end-user should be able to decide how many of those "unkown" hints he wants to have. Some projects require very high fidelity on their license usage while other don't and also do not have the capacity to check that many findings. |
So the core question really is: what do you think "unknown license" means? Is it "there is a license but scancode doesn't know which one because it is not in its knowledgebase" (whether or not it is open or closed) or "scancode couldn't detect which license it is and threw its hands up"? This is conceptually a big difference. |
I think this discussion is a bit deviation from the original problem here. It doesn't matter what anyone thinks "unkown license" means. We are discussing how the heuristics could be improved and ways to give the end users more options to evaluate findings. I am talking here as an end-user of ORT, which is using ScanCode as a scanner. Now we started to migrate away from NexusIQ to ORT and we see sometime hundreds of those "unknown-license" findings in big projects. There are many examples of trivial finding where the heuristics/rules used in ScanCode are just to broad and get triggered for simple comments using the word "license". We are not only discussing the general problem here of how to improve the rule based findings. There are many example given in the first post. It's not only about "unknown" licenses. |
I have attached a presentation to better grap a summary of the issue: |
@richardfontana I would be interested to get some feedback too |
@sutula may find this of interest |
On the topic of making the current rule set a bit more stringent: Proposal: doing an "automated" retro-fit of all rules to include SPDX identifiers in Some thoughts for this update:
cd src/scancode-toolkit/src/licensedcode/data/rules
for identifier in `tac ~/tmp/spdx_identifier.list`;
do
echo $identifier;
for rule in `egrep -l '([^A-Z]|^)('$identifier')([^A-Z]|$)' *.RULE`;
do
sed -i -E s/\(\[\^A-Z\{\{\\]\|\^\)\($identifier\)\(\[\^A-Z\}\}\]\|\$\)/\\1\{\{$identifier\}\}\\3/g $rule;
done;
done Above one liner makes changes to about 6400/31000 rules. |
@alext34ms re:
Sleek! very smart. I like it
I do not think there it would have any impact.
These should be left alone. These are rules that can be matched only exactly and that are about licenses but are NOT license notices or texts. They should be used sparingly as a last resort. For instance, this text:
is NOT a GPL-related notice, but some commentary about the GPL license and would be a typical case for a "false positive" rule.
It should, but it may also degrade and miss some matches in a few corner cases. These could be caught separately by the
It will likely make some fail.
I think the approach could be refined using a Python script as we have code that handle the RULEs and has all SPDX licenses alright and we could also expand this to a few more things:
Some scripts examples to use as a base are in https://github.com/nexB/scancode-toolkit/blob/develop/etc/scripts/licenses/ |
Hi Philippe, all, First things first: thanks for the good work people! You're great!
Some lines (typically {4,5}) are recognised as Would it be helpful to provide a list of false positives / wrong identifications? I'll be happy to provide one if so. |
@borisbaldassari Thanks for your feedback and report. We are working on this and this specific issue of a
This would be extremely helpful, we will use this for testing this new feature extensively, as we are using the other lists contributed here. Thanks a lot! |
Hi @AyanSinhaMahapatra Thanks for the head-up! I'll wait for the landing and give it a try. :-) Please find below a list of unknown-license false positives found in a few Eclipse projects (Che, JGit, CDT, Tycho). If needed I can analyse more projects -- but since we're using ORT we don't have direct access to the scancode output, so I need to run it separately (and manually). |
Please also find attached the Python script used to generate the csv's, if it's useful. |
@borisbaldassari Thank you ++ |
Another short SSPL false positive #2975 |
As @pombredanne also asked in my initial issue for a list of false positives I just wanted to mention that the ORT community also started sharing curations for false positives. I guess ScanCode ist still one of the most widely used scanning component in ORT so they might all be relevant for you. Check out their curantions and package configurations: https://github.com/oss-review-toolkit/ort-config |
Context
We are reporting too many false positive licenses. We need to fix this!
Problem
There are several false cases, yet they boil down to these types:
False detection of very short and weak license detection rules detected exactly such as:
GPL
in a binary Tracing "Start Line" of ScanCode report back to the Binary file. #2874may not be modified
in False-positiveproprietary-license
finding in Guava source code #2865Detection of a license text or notice fragment which is too weak to represent a bona fide license detection alone.
Detection of longer unknown license references such as
Lack of proper detection of a structured license tag found in a package manifest which is returned as an unknown license
When fragments of the same license are detected with only copyrights added in between as in license detection: Add the nunit license #2859
When sequence of SPDX licenses id are found in license detection tools
Please add yours!
Solution elements
We could treat and report separately mere clues such as this one: they could be an interesting insight in some cases, but alone they are too weak to be considered a license detection
The upcoming two-step process where license matches are grouped in a license detection is another way to consider. We could detect patterns of license matches that could be resolved in a detection. For instance a license intro followed by a license notice.
The scancode-analyzer heuristics and ML-based detection of false positive is another way
The text was updated successfully, but these errors were encountered: