-
-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect unknown licenses #1675 #2592
Detect unknown licenses #1675 #2592
Conversation
Signed-off-by: akugarg <[email protected]>
190554b
to
aecab91
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks... See some nitpickings for your review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! there a few tiny nit pickings for your consideration and then we can merge!
Signed-off-by: akugarg <[email protected]>
962d1f5
to
de2e0d0
Compare
This should add some tests on licenses that ideally scancode would never add (some proprietary/sdk licenses), to check the accuracy of unknown license detection. Linking some of them here: |
Rather than using a function argument Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This is a batch of misc. license detection rules and a new license. These have been mostly found thanks to the upcoming unknown license detection. Signed-off-by: Philippe Ombredanne <[email protected]>
This makes it available in the CLI Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This is cleaner and more composable than old-style interpolation Signed-off-by: Philippe Ombredanne <[email protected]>
Add split_weak_matches() function to pre-filter weak unknown matches. Make unknown matches eligible for filter_spurious_matches() and lower minimum density to 0.6=5 for longer matches. Move the call for filter_spurious_matches() earlier in the refine pipeline. Add new filter_invalid_contained_unknown_matches() function to discard unknown matches found inside the matched queryregion of larger regular matches Extract get_full_qspan_matched_text() function from get_full_matched_text() for improve reusability. This is designed to be called when crafting new rules absed on a match (which is what is done with unknown matches). Use format for matched license text highlight This is cleaner and more composable than old-style interpolation. Improve debug tracing of matched texts. Apply other minor refactoring and doc impropvements Signed-off-by: Philippe Ombredanne <[email protected]>
Create unique rule id based on a checksum of the rule content Also improve key phrases parsing for dnagling {{ {{ braces. Signed-off-by: Philippe Ombredanne <[email protected]>
Create proper synthethic Rule and LicenseMatch on match and return a match or None. Include unknown licenses matching as an option to Index.match Add tests Use shorter ngrams of length 6 rather than 7 for better sensitivity This is balanced by the addition of filters: - Filter weak unknown matches at match time - Filter out several weak unknown ngrams at index time Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
@akugarg @AyanSinhaMahapatra your review is welcomed! |
Note that you can ignore most of the rules additions, they should have been in another PR, but I could not resist adding them... I forced an unknown licenses detection on our whole test suite and this helped spots several issues that I fixed on the fly. :P |
Note that this is NOT YET returned in the API and outputs Signed-off-by: Philippe Ombredanne <[email protected]>
I am merging now. Your feedback is still mucho welcomed on the refinements, |
Signed-off-by: akugarg [email protected]
This PR introduces a new and effective way for detection of unknown licenses by making use of n-grams.
Refer #1675 for more details.
Tasks
Run tests locally to check for errors.