Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect unknown licenses #1675 #2592

Merged
merged 16 commits into from
Jan 8, 2022

Conversation

akugarg
Copy link
Collaborator

@akugarg akugarg commented Jul 13, 2021

Signed-off-by: akugarg [email protected]
This PR introduces a new and effective way for detection of unknown licenses by making use of n-grams.
Refer #1675 for more details.

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 📑 and links the original issue above 🔗
  • Tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 📁

@akugarg akugarg force-pushed the improve_license_detection branch from 190554b to aecab91 Compare July 13, 2021 11:37
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks... See some nitpickings for your review.

src/licensedcode/index.py Outdated Show resolved Hide resolved
src/licensedcode/match_unknown.py Outdated Show resolved Hide resolved
src/licensedcode/match_unknown.py Show resolved Hide resolved
src/licensedcode/match_unknown.py Outdated Show resolved Hide resolved
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! there a few tiny nit pickings for your consideration and then we can merge!

src/licensedcode/index.py Outdated Show resolved Hide resolved
src/licensedcode/index.py Outdated Show resolved Hide resolved
src/licensedcode/index.py Outdated Show resolved Hide resolved
@akugarg akugarg force-pushed the improve_license_detection branch from 962d1f5 to de2e0d0 Compare August 30, 2021 17:28
@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Sep 20, 2021

This should add some tests on licenses that ideally scancode would never add (some proprietary/sdk licenses), to check the accuracy of unknown license detection.

Linking some of them here:

@pombredanne pombredanne added this to the v31.0 milestone Sep 24, 2021
Rather than using a function argument

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This is a batch of misc. license detection rules and a new license.
These have been mostly found thanks to the upcoming unknown license
detection.

Signed-off-by: Philippe Ombredanne <[email protected]>
This makes it available in the CLI

Signed-off-by: Philippe Ombredanne <[email protected]>
Signed-off-by: Philippe Ombredanne <[email protected]>
This is cleaner and more composable than old-style interpolation

Signed-off-by: Philippe Ombredanne <[email protected]>
Add split_weak_matches() function to pre-filter weak unknown matches.
Make unknown matches eligible for filter_spurious_matches() and lower
minimum density to 0.6=5 for longer matches. Move the call for
filter_spurious_matches() earlier in the refine pipeline.

Add new filter_invalid_contained_unknown_matches() function to discard
unknown matches found inside the matched queryregion of larger regular
matches

Extract get_full_qspan_matched_text() function from
get_full_matched_text() for improve reusability. This is designed to be
called when crafting new rules absed on a match (which is what is done
with unknown matches).

Use format for matched license text highlight
This is cleaner and more composable than old-style interpolation.

Improve debug tracing of matched texts.

Apply other minor refactoring and doc impropvements

Signed-off-by: Philippe Ombredanne <[email protected]>
Create unique rule id based on a checksum of the rule content

Also improve key phrases parsing for dnagling {{ {{ braces.

Signed-off-by: Philippe Ombredanne <[email protected]>
Create proper synthethic Rule and LicenseMatch on match and return
a match or None.

Include unknown licenses matching as an option to Index.match

Add tests

Use shorter ngrams of length 6 rather than 7 for better sensitivity
This is balanced by the addition of filters:
- Filter weak unknown matches at match time
- Filter out several weak unknown ngrams at index time

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

@akugarg @AyanSinhaMahapatra your review is welcomed!
At this stage the unknown detection works quite nicely IMHO and I am planning to merge this ASAP

@pombredanne
Copy link
Member

Note that you can ignore most of the rules additions, they should have been in another PR, but I could not resist adding them... I forced an unknown licenses detection on our whole test suite and this helped spots several issues that I fixed on the fly. :P

@pombredanne pombredanne changed the title Improve licenses detection accuracy of unknown licenses Detect unknown licenses #1675 Jan 8, 2022
Note that this is NOT YET returned in the API and outputs

Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne
Copy link
Member

I am merging now. Your feedback is still mucho welcomed on the refinements,

@pombredanne pombredanne merged commit d1e725d into aboutcode-org:develop Jan 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants