Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Follow references to license in another file: ScanCode reports the "SEE LICENSE IN <filename>" text in an NPM package.json as "Unkown" #1364

Closed
sschuberth opened this issue Feb 14, 2019 · 19 comments · Fixed by #2616

Comments

@sschuberth
Copy link
Collaborator

For non-SPDX / propretary licenses, NPM suggests to use a value of "SEE LICENSE IN " for the "license" key in package.json. If the license file is called LICENSE, that text ends up to be "SEE LICENSE IN LICENSE". At least ScanCode 2.9.7 reports this string as an "Unknown" license when scanning package.json. IMO, that text should not trigger a license finding in this case.

@sschuberth
Copy link
Collaborator Author

/cc @tsteenbe

@pombredanne
Copy link
Member

pombredanne commented Feb 14, 2019

@sschuberth

IMO, that text should not trigger a license finding in this case.

I think it should report a license and yes, unknown is not great.
It would be a welcomed addition to effectively deference these "see license" type of references.

I have some plans to actually dereference any such detection in npm and beyond.

For a start, license detection rules have a new field called "referenced_filenames" which is a list of filename references. See https://github.com/nexB/scancode-toolkit/search?l=YAML&q=%22referenced_filenames%22

Based on that, the logic will then be something more or less this way:
If a license is detected and contains "referenced_filenames", check if the filename exists uniquely in the same directory or in a parent/ancestor directory. If found, report the license detected in that "see file" in the current file as a new match with an appropriate type, (and possibly remove the "see file" match if it pointed to an unknown license.)

For special cases such as npm "license" attribute that use a special convention, the logic would be to parse that directly and inject that logic in the code https://github.com/nexB/scancode-toolkit/blob/fd0a95a04658178b8b4e74351bb84a392b618383/src/packagedcode/licensing.py#L68 that does declared license normalization.

@sschuberth
Copy link
Collaborator Author

I have some plans to actually dereference any such detection in npm and beyond.

I was briefly thinking about something like that, too, but felt it was overkill in this particular case as usually not only package.json is scanned, but all files in the package's repository / artifact, which includes the referenced file, and hence the license in there would be picked up anyway.

@pombredanne
Copy link
Member

pombredanne commented Jul 24, 2019

At this stage most license rules have been tagged with a referenced_filenames if they reference such a file. The next step is going to be to follow the referenced filenames in the general cases and in the special case of npm "SEE LICENSE IN" and point to the referenced license instead.

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Jun 28, 2021

@akugarg Here's an example npm package, see https://github.com/mongodb-js/vscode/blob/master/package.json#L34.

Here "license": "SEE LICENSE IN LICENSE.txt", references this file https://github.com/mongodb-js/vscode/blob/master/LICENSE.txt. You can find more here

We have to create a post-scan plugin with the following steps:

  1. first find all codebase resources with a rule which has a referenced filename.
  2. get rid of some cases of referenced filenames (some debian references beginning with usr/share/common-licenses/ will not be found in the directory tree.)
  3. try to find the file in the referenced-filename first in the same directory, if not then in the codebase, and do nothing if not found.
  4. If found, update the unknown license match found before, with the license matches for the file (and add a flag maybe).
  5. test this with examples in multiple npm packages (all cases same directory/other directory/do nothing), where this is found a lot. You can just copy the two files mentioned above for a unittest.

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Jul 20, 2021

Btw, instead of a post-scan plugin, this could be a process_codebase step in a scan plugin, similar to here.

The get_scanner function for the license plugin is already there. The process_codebase could be defined inside the license plugin as a post processing step, but not a post-scan plugin explicitly.

Also, we can focus on adding this just for files in the same directory, with proper tests, and then move on to adding more complicated cases in packagedcode.

@akugarg
Copy link
Collaborator

akugarg commented Jul 22, 2021

From chat@gitter

Akanksha:
For the reference part the step which is done here akugarg@1a7ea7d#diff-9b8c9108cb6f529dc48ef0aba7a714c7e2b53feac33c7dc91eac423685a194b7R185 it will be present here only right since for every match we need to check for referenced files?

Ayan:
Why is the check done here, we can do this is the process_codebase function in plugin_license right, like we discussed previous day? The entire code can live there, unless I'm missing something?
we have access to the license list and LicenseMatch objects in it already in the process_codebase so why not do it there entirely.
Yes but this get_licenses is basically an API, which populates each resource with what is returned from the get_scanner function, so here the licenses list is added as the resource_attribute for each resource. Now that is available to all process_codebase functions that is called in either the same scan plugin/ scan plugins run after that, or post scan plugins. So we need not modify it directly at get_licenses but later in a seperate function which is the process_codebase function.
I.e. yes you can do what you are doing, it would obviously work and do what it is supposed to do. But it will change the functionality of get_licenses. We don't want that as it is used as an API in many other places widely (inside and outside scancode-toolkit). We want this to be added to the plugin_license post-processing step instead, so this step will only be done when the plugin is called in scancode-toolkit directly, and not via the API.

Akanksha:
Also one point how we will we have the access to "matches" thing since we would need to get "match.rule.referenced_filenames" ?

@AyanSinhaMahapatra
Copy link
Member

See how we get resources here and from them resource attributes are accessed here.

@akugarg
Copy link
Collaborator

akugarg commented Jul 28, 2021

akugarg@8681972
I tried this but how to get root_path for particular match since we need this for searching in particular directory.

@AyanSinhaMahapatra
Copy link
Member

See here for functions realted to Resource object on how to get parent resource. Have a look at the other functions for this class too.

@pombredanne
Copy link
Member

This should be still open:

  1. the npm part is not implemented
  2. there are issue wrt. adding license matches of referenced files to the main license match with "SEE LICENSE IN"...
  • for now the matches are just copied
  • there is no indication (attribute, or else) that tells me that a match is derived from following the license of a referenced filename. There is no indication either that the referenced filename matches are now copied/added to the referencing place
  • how we should handle these is not entirely clear to me:
    • should these be matches side by side like now?
    • or sub-matches added as an attribute to the referencing match?
    • should the referenced matches be moved rather than copied in the referencing?
    • how to track which files/path a referenced match is for (this is lots for now)? (and then us that for display of matched text?)
    • should the referencing match be removed now that we have referenced matches for it?
    • should the referenced matches be follow even if the match has a low coverage (and possibly the parts about the file references were not matched?)

@pombredanne pombredanne reopened this Aug 29, 2021
@pombredanne
Copy link
Member

@akugarg @AyanSinhaMahapatra ^ FYI

@pombredanne pombredanne changed the title ScanCode reports the "SEE LICENSE IN <filename>" text in an NPM package.json as "Unkown" Follow references to license another file: ScanCode reports the "SEE LICENSE IN <filename>" text in an NPM package.json as "Unkown" Aug 29, 2021
@pombredanne pombredanne changed the title Follow references to license another file: ScanCode reports the "SEE LICENSE IN <filename>" text in an NPM package.json as "Unkown" Follow references to license in another file: ScanCode reports the "SEE LICENSE IN <filename>" text in an NPM package.json as "Unkown" Aug 29, 2021
pombredanne added a commit that referenced this issue Aug 29, 2021
Only follow license references match an exact filename
In #2616 we introduced matching path of referenced_filenames
based on matching filename or path suffix. This removes path suffix
matching which is problematic.

Before this we were using .endswith(path) and this led to weird and
incorrect license dereferences

Signed-off-by: Philippe Ombredanne <[email protected]>
pombredanne added a commit that referenced this issue Sep 3, 2021
Improve license referenced_filenames handling #1364
pombredanne added a commit that referenced this issue Sep 14, 2021
Signed-off-by: Philippe Ombredanne <[email protected]>
@pombredanne pombredanne added this to the v31.0 milestone Sep 24, 2021
@pombredanne
Copy link
Member

The new https://github.com/nexB/scancode-toolkit/blob/d1e725d3603a8f96c25f7e3f7595c68999b92a67/src/licensedcode/detection.py is what's needed to complete this.

@sschuberth
Copy link
Collaborator Author

Out of curiosity, are you planning to make following these kind of references to only work when ScanCode is run with -p, or should it also work without -p?

@pombredanne
Copy link
Member

Out of curiosity, are you planning to make following these kind of references to only work when ScanCode is run with -p, or should it also work without -p?

This is a feature of license detection in general so this is not only for --package scans but for any license detection that includes such as reference.
For instance This is free software. See COPYING for details. see in https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/unknown-license-reference_91.RULE should be resolved based on the proper COPYING license found and detected, using the metadata in https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/data/rules/unknown-license-reference_91.yml#L4

In the context of a package scan the accuracy should be much better since we work from structured data.

@pombredanne
Copy link
Member

@sameer1046 FYI

@pombredanne
Copy link
Member

@sschuberth actually this is mostly there (at least in the develop branch):
Say you have a directory with two files:

$ cat license-ref/COPYING
license: apache 2.0

$ cat license-ref/ref
This is free software. See COPYING for details.

$ scancode -l --license-text --license-text-diagnostics --yaml - license-ref/ yields for license-ref/ref :

        license_expressions:
            - unknown-license-reference
            - apache-2.0

To finish this store, the use of the new "Detection" object that can merge multiple detection without loosing details will logically "merge" the two license expressions above in a single apache-2.0 license detection, removing the noise from the ``unknown-license-reference` AND still keep all the detections details that led to this (which is IMHO important for full traceability and clarity)

headers:
    -   tool_name: scancode-toolkit
        tool_version: 30.1.0
        options:
            input:
                - license-ref/
            --license: yes
            --license-text: yes
            --license-text-diagnostics: yes
            --unknown-licenses: yes
            --yaml: '-'
        notice: |
            Generated with ScanCode and provided on an "AS IS" BASIS, WITHOUT WARRANTIES
            OR CONDITIONS OF ANY KIND, either express or implied. No content created from
            ScanCode should be considered or used as legal advice. Consult an Attorney
            for any legal advice.
            ScanCode is a free software code scanning tool from nexB Inc. and others.
            Visit https://github.com/nexB/scancode-toolkit/ for support and download.
        start_timestamp: '2022-01-13T080858.894595'
        end_timestamp: '2022-01-13T080900.669069'
        output_format_version: 2.0.0
        duration: '1.7744977474212646'
        message:
        errors: []
        extra_data:
            spdx_license_list_version: '3.15'
            OUTDATED: 'WARNING: Outdated ScanCode Toolkit version! You are using an outdated
                version of ScanCode Toolkit: 30.1.0 released on: 2021-09-24. A new version is
                available with important improvements including bug and security fixes, updated
                license, copyright and package detection, and improved scanning accuracy. Please
                download and install the latest version of ScanCode. Visit https://github.com/nexB/scancode-toolkit/releases
                for details.'
            files_count: 2
files:
    -   path: license-ref
        type: directory
        licenses: []
        license_expressions: []
        percentage_of_license_text: '0'
        scan_errors: []
    -   path: license-ref/COPYING
        type: file
        licenses:
            -   key: apache-2.0
                score: '100.0'
                name: Apache License 2.0
                short_name: Apache 2.0
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Apache Software Foundation
                homepage_url: http://www.apache.org/licenses/
                text_url: http://www.apache.org/licenses/LICENSE-2.0
                reference_url: https://scancode-licensedb.aboutcode.org/apache-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.yml
                spdx_license_key: Apache-2.0
                spdx_url: https://spdx.org/licenses/Apache-2.0
                start_line: 1
                end_line: 1
                matched_rule:
                    identifier: apache-2.0_65.RULE
                    license_expression: apache-2.0
                    licenses:
                        - apache-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: no
                    is_license_tag: yes
                    is_license_intro: no
                    has_unknown: no
                    matcher: 1-hash
                    rule_length: 4
                    matched_length: 4
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: 'license: apache 2.0'
        license_expressions:
            - apache-2.0
        percentage_of_license_text: '100.0'
        scan_errors: []
    -   path: license-ref/ref
        type: file
        licenses:
            -   key: unknown-license-reference
                score: '100.0'
                name: Unknown License file reference
                short_name: Unknown License reference
                category: Unstated License
                is_exception: no
                is_unknown: yes
                owner: Unspecified
                homepage_url:
                text_url:
                reference_url: https://scancode-licensedb.aboutcode.org/unknown-license-reference
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.yml
                spdx_license_key: LicenseRef-scancode-unknown-license-reference
                spdx_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-license-reference.LICENSE
                start_line: 1
                end_line: 1
                matched_rule:
                    identifier: unknown-license-reference_91.RULE
                    license_expression: unknown-license-reference
                    licenses:
                        - unknown-license-reference
                    referenced_filenames:
                        - COPYING
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: yes
                    is_license_tag: no
                    is_license_intro: no
                    has_unknown: yes
                    matcher: 1-hash
                    rule_length: 8
                    matched_length: 8
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: This is free software. See COPYING for details.
            -   key: apache-2.0
                score: '100.0'
                name: Apache License 2.0
                short_name: Apache 2.0
                category: Permissive
                is_exception: no
                is_unknown: no
                owner: Apache Software Foundation
                homepage_url: http://www.apache.org/licenses/
                text_url: http://www.apache.org/licenses/LICENSE-2.0
                reference_url: https://scancode-licensedb.aboutcode.org/apache-2.0
                scancode_text_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.LICENSE
                scancode_data_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/apache-2.0.yml
                spdx_license_key: Apache-2.0
                spdx_url: https://spdx.org/licenses/Apache-2.0
                start_line: 1
                end_line: 1
                matched_rule:
                    identifier: apache-2.0_65.RULE
                    license_expression: apache-2.0
                    licenses:
                        - apache-2.0
                    referenced_filenames: []
                    is_license_text: no
                    is_license_notice: no
                    is_license_reference: no
                    is_license_tag: yes
                    is_license_intro: no
                    has_unknown: no
                    matcher: 1-hash
                    rule_length: 4
                    matched_length: 4
                    match_coverage: '100.0'
                    rule_relevance: 100
                matched_text: 'license: apache 2.0'
        license_expressions:
            - unknown-license-reference
            - apache-2.0
        percentage_of_license_text: '100.0'
        scan_errors: []

@pombredanne
Copy link
Member

We have support for this now, but the approach is refined in the next release. I am moving this there instead

@pombredanne pombredanne modified the milestones: v31.0, v32.0 Jun 14, 2022
@AyanSinhaMahapatra
Copy link
Member

This is now supported comprehensively, and merged. Closing this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants