Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong spdx detection for file generator #2502

Open
tardyp opened this issue Apr 22, 2021 · 7 comments
Open

Wrong spdx detection for file generator #2502

tardyp opened this issue Apr 22, 2021 · 7 comments
Labels

Comments

@tardyp
Copy link
Contributor

tardyp commented Apr 22, 2021

Some of our files contain generator of source code.

Looks like this:

def printHeader(dest):
    year = datetime.datetime.now().year
    print("# Copyright (C) %s, ACME, All Right Reserved." % year, file=dest)
    print("#", file=dest)
    print("# This file is subject to the terms and conditions of", file=dest)
    print("# ACME Software License Agreement", file=dest)
    print("# SPDX-License-Identifier: LicenseRef-ACME-Proprietary", file=dest)
    print("#", file=dest)
    print("# DO NOT EDIT MANUALLY", file=dest)
    print("# This file was autogenerated by mighty_generate.py", file=dest)

The 5th line got detected by the spdx matcher, but the detected SPDX license is unknown

We do add LicenseRef-ACME-Proprietary in our custom SPDX license, but it is not recognised, probably the end of the line is taken as part of the full SPDX ID.

@tardyp tardyp added the bug label Apr 22, 2021
@pombredanne
Copy link
Member

@tardyp thanks.

  1. SPDX ids are detected if they start in the first few words of a line per https://github.com/nexB/scancode-toolkit/blob/9ff0880379aa68cc8d1c4e31cc9c206c6b8e9932/src/licensedcode/query.py#L370
    So print("# SPDX-License-Identifier: LicenseRef-ACME-Proprietary", file=dest) would be collected and detected

  2. we assume that everything after SPDX-License-Identifier: is an SPDX license expression up to the end of the line
    https://github.com/nexB/scancode-toolkit/blob/9ff0880379aa68cc8d1c4e31cc9c206c6b8e9932/src/licensedcode/query.py#L390

  3. when doing the detection there are a few cleanups applied to deal with some well known weird and common issues, and then the expression is parsed.... For now we do not consider unknown LicenseRef as a valid license. Furthermore there is some trailing "jink" in file=dest
    See https://github.com/nexB/scancode-toolkit/blob/9ff0880379aa68cc8d1c4e31cc9c206c6b8e9932/src/licensedcode/match_spdx_lid.py

So LicenseRef-ACME-Proprietary", file=dest) will be reported as something unknown
But So LicenseRef-ACME-Proprietary would be reported as something unknown

Some solution elements:

  1. fix the false positive detection for the few patterns of this code generator with a false positive rule? ... ? not sure.
  2. do not collect SPDX-License-Identifiers when there is the word "print" just before or something that could make this code-like looking , but that could be dangerous
  3. we need to find a way to report LicenseRef-something correctly when this is not a known license in scancode license dataset
  4. provide a way to "fix" certain matches in a local, specific way with some config file. Here for instance

@tardyp
Copy link
Contributor Author

tardyp commented Apr 23, 2021

hi @pombredanne thanks for detalled comments. makes a lot of sense

Some comments:

But So LicenseRef-ACME-Proprietary would be reported as something unknown

LicenseRef-ACME-Proprietary is not reported as something unknown in my instance as I have custom rules that defines it.(a fix for #2471 that is not yet ready for upstream). I think this is the best way to solve it for me. This ensures that all SPDX licenses are correctly identified by the IP Management group.
We could report it as expected, but we would still need some kind unknown_spdx flag so that it can be reviewed.

fix the false positive detection for the few patterns of this code generator with a false positive rule?

I think this fits my problem and I will experiment on it for short term.

do not collect SPDX-License-Identifiers when there is the word "print"

I had a similar idea, but print is not universal. what is more universal are to me are quotes
I was thinking about skipping the identification if:

  • there are two of the same quotes (",',` ) in the line
  • SPDX-License-Identifier is between those quotes

@pombredanne
Copy link
Member

Yes, quotes would be better... we are mostly ignoring quotes.

Except for SPDX License Identifiers, where we can access to the raw text. In https://github.com/nexB/scancode-toolkit/blob/bb044200ae86770f9bb01560c0033037ee18b947/src/licensedcode/query.py#L441 we could likely add a test on the content.

If I use this one liner as a test file

print("# SPDX-License-Identifier: LicenseRef-ACME-Proprietary", file=dest)

Then in https://github.com/nexB/scancode-toolkit/blob/bb044200ae86770f9bb01560c0033037ee18b947/src/licensedcode/query.py#L441

we have these values to use as needed:

spdx_expression = 'LicenseRef-ACME-Proprietary", file=dest)'
line = 'print("# SPDX-License-Identifier: LicenseRef-ACME-Proprietary", file=dest)'

The mere presence of a single quote may be enough to flag this as a false positive IMHO... BUT there could be cases (such as formatted markdown?) where this may not be a false positive?

@soimkim
Copy link
Contributor

soimkim commented Jun 19, 2021

I also have a similar issue.
I want to print the items marked with SPDX License, but the ScanCode json result is output as follows.

  • Information written to the file:
    # SPDX-License-Identifier: LicenseRef-Sample-Proprietary

  • ScanCode json result :
    "licenses": [ { "key": "unknown-spdx", "score": 83.33, "name": "Unknown SPDX license detected but not recognized", "short_name": "unknown SPDX", "category": "Unstated License", "is_exception": false, "owner": "Unspecified", "homepage_url": null, "text_url": "", "reference_url": "https://scancode-licensedb.aboutcode.org/unknown-spdx", "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-spdx.LICENSE", "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-spdx.yml", "spdx_license_key": "LicenseRef-scancode-unknown-spdx", "spdx_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-spdx.LICENSE", "start_line": 3, "end_line": 3, "matched_rule": { "identifier": "spdx-license-identifier: unknown-spdx", "license_expression": "unknown-spdx", "licenses": [ "unknown-spdx" ], ...

Although it is not a license registered in SPDX, is there a way to print the License ID ("Sample-Proprietary" in the example above) if it is written according to the notation of the SPDX License?

If "key" extracted as "unknown-spdx", how about outputting "identifier" in "matched_rule" as License ID written?

@pombredanne
Copy link
Member

This is not what you ask but at least if you use the option --license-text you will also see the original expression:

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "21.6.7",
      "options": {
        "input": [
          "spdx"
        ],
        "--json-pp": "-",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2021-06-19T155103.773221",
      "end_timestamp": "2021-06-19T155107.230224",
      "duration": 3.457033157348633,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "spdx",
      "type": "file",
      "licenses": [
        {
          "key": "unknown-spdx",
          "score": 100.0,
          "name": "Unknown SPDX license detected but not recognized",
          "short_name": "unknown SPDX",
          "category": "Unstated License",
          "is_exception": false,
          "owner": "Unspecified",
          "homepage_url": null,
          "text_url": "",
          "reference_url": "https://scancode-licensedb.aboutcode.org/unknown-spdx",
          "scancode_text_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-spdx.LICENSE",
          "scancode_data_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-spdx.yml",
          "spdx_license_key": "LicenseRef-scancode-unknown-spdx",
          "spdx_url": "https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/licenses/unknown-spdx.LICENSE",
          "start_line": 1,
          "end_line": 1,
          "matched_rule": {
            "identifier": "spdx-license-identifier: unknown-spdx",
            "license_expression": "unknown-spdx",
            "licenses": [
              "unknown-spdx"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": false,
            "is_license_tag": true,
            "is_license_intro": false,
            "matcher": "1-spdx-id",
            "rule_length": 6,
            "matched_length": 6,
            "match_coverage": 100.0,
            "rule_relevance": 100
          },
          "matched_text": "SPDX-License-Identifier: LicenseRef-Sample-Proprietary"
        }
      ],
      "license_expressions": [
        "unknown-spdx"
      ],
      "percentage_of_license_text": 100.0,
      "scan_errors": []
    }
  ]
}

@soimkim
Copy link
Contributor

soimkim commented Jun 21, 2021

Dear @pombredanne ,
Thanks for the quick reply.
In my case, I just load the value of matched_text.

@pombredanne
Copy link
Member

@soimkim I entered a new issue for your report as I think it may be best tracked separately. Please see #2650

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants