Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code wrongly detected as gpl-1.0 #2371

Open
xu1119 opened this issue Jan 21, 2021 · 3 comments
Open

Code wrongly detected as gpl-1.0 #2371

xu1119 opened this issue Jan 21, 2021 · 3 comments

Comments

@xu1119
Copy link

xu1119 commented Jan 21, 2021

When trying to scan this file with latest scancode, It get the following license :
file from https://github.com/zyq8709/DexHunter/blob/master/dalvik/vm/compiler/codegen/x86/AnalysisO1.cpp

{
          "key": "gpl-1.0",
          "score": 100.0,
          "name": "GNU General Public License 1.0",
          "short_name": "GPL 1.0",
          "category": "Copyleft",
          "is_exception": false,
          "owner": "Free Software Foundation (FSF)",
          "homepage_url": "http://www.gnu.org/licenses/gpl-1.0.html",
          "text_url": "http://www.gnu.org/licenses/gpl-1.0.txt",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:gpl-1.0",
          "spdx_license_key": "GPL-1.0-only",
          "spdx_url": "https://spdx.org/licenses/GPL-1.0-only",
          "start_line": 1606,
          "end_line": 1606,
          "matched_rule": {
            "identifier": "gpl-1.0_15.RULE",
            "license_expression": "gpl-1.0",
            "licenses": [
              "gpl-1.0"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": false,
            "is_license_tag": true,
            "matcher": "2-aho",
            "rule_length": 2,
            "matched_length": 2,
            "match_coverage": 100.0,
            "rule_relevance": 100.0
          },
          "matched_text": "            currentBB->xferPoints[currentBB->num_xfer_points].vr_gpl = -1;"
        },

Description

Source code wrongly detected as gpl-1.0

How To Reproduce

scancode -li --license-text --json-pp - AnalysisO1.cpp

System configuration

  • What OS are you running on? (Windows/MacOS/Linux)
    Ubuntu18.04
  • What version of scancode-toolkit was used to generate the scan file?
    ScanCode 3.2.3
  • What installation method was used to install/run scancode? (pip/source download/other)
    pip
  • Python version
    Python 3.6.12 :: Anaconda, Inc.
@xu1119 xu1119 added the bug label Jan 21, 2021
@pombredanne
Copy link
Member

@xu1119 Thanks!
@AyanSinhaMahapatra would your new plugin be able to spot this?

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Jan 21, 2021

@pombredanne No (and Yes). But it should have definitely, so this was a good find.

  1. So in, most of the false positives I got before, the common factor was that rule_length was 1, as in it got matched to a very simple rule having just the name of the license, like just gpl. But this got matched to gpl-1.0_15.RULE for which the text is gpl 1.
    So now the preliminary step to separate probable false-positives was, "is_license_tag" == true and "rule_length" == 1 as here, and then run it through a classifier to determine that more accurately.
    We definitely need to set in place a more explicit step, by going through all the scancode license_tag rules, and see which ones have the potential to be matched to become a false_positive and then either increase these "rule_length" criteria for these cases to be correctly analyzed too or even maintain a set of rules which can generate potential false positives, adding a ticket now and doing the same.

  2. The sentence classifier step, i.e. the false_positive vs license_tag NLP classifier does correctly detect this. So, that works. The prelim step to only take out matches with "rule_length" == 1 was done because the assumption was, false positives are generated from only these rules, so we don't have to pass all the license_tag matches through the classifier. But there's clearly exceptions to this assumption, like this case here, and we should be able to detect that.

Thanks @xu1119

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Jan 21, 2021

Also @pombredanne there's a ticket open for an extra heuristic you suggested, here at aboutcode-org/scancode-analyzer#29, implementing this (without the single-word, making things more explicit here as discussed above) also would be able to detect this, since the "start_line": 1606.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants