Custom License Rules folder #2471

tardyp · 2021-03-31T15:13:59Z

Short Description

Our internal code has copyright headers that we would like to properly categorize.
We don't think it make sense to upstream those rules, and we want to avoid forking scancode.

Thus we would like to add an option to scan code to provide a folder path which would contain custom .yml + RULE files.

Possible Labels

license scan

Select Category

How This Feature will help you/your organization

This would help us to use scancode to categorize proprietary code we get from subcontractors

Possible Solution/Implementation Details

User would say

scancode -clip --json-pp --custom_licenses=/path/to/licenses --custom_rules=/path/to/rules - path/to/code

Can you help with this Feature

We are willing to provide a PR for this feature

mjherzog · 2021-03-31T15:38:12Z

This feature should be useful for many ScanCode users.

pombredanne · 2021-03-31T18:22:07Z

This makes sense. @richardfontana requested this feature in #480 and I reckon I have been slow to act as I was fearing fragmenting the database of licenses. In hindsight, this is unlikely a (unexpressed) valid concern I had then.

Now if these are just a few proprietary license and headers, it could be well worth adding them to scancode anyway.

And to implement this feature here are some thoughts:

A) the base approach to get these the extra rules in scancode:

a directory that contains extra rules and you could point to with some command line argument
a "plugin" where we package extra licenses and rules in a Python package and that can be installed as some private extra locally.

I am leaning towards 2. as otherwise this may be complicated to deploy this.

B) how these rules and license would be consumed:

they could be merged in scancode main index
they could be included in their own secondary index (with either A.1 or A.2) and the detection would run using this (or these) extra indexes either before of after the main index, and the matched results merged

I am not sure which is best.

@tardyp We could have a quick live session to iron out a path!

tardyp · 2021-04-01T09:06:01Z

I didn't see #480, as I only focused my search on the keyword RULES.

I like what I see there, especially the idea from @DennisClark to automatically create this custom folder based on Unknown License findings.

In my first scans with scancode, we end up with big pile of unknown license, which is normal as we want to use scancode to make sure our proprietary software is not mixed up with open-source, and that our devs use packaging techniques to compose software.

I spent some time yesterday to experiment with the source code of scancode, and indeed dicovered the huge license library and the need to cache the index.

I am not sure if for custom license there is really a usecase where those number will be so big that they need to be cached as well. The needed cache module refactoring seems quite scary to me.

What I like with secondary index is that we could skip primary matching all together if the secondary index match score is high enough.

This could open the path to a quick scan mode that we could put in the pre-commit CI.

pombredanne · 2021-04-01T13:47:52Z

In my first scans with scancode, we end up with big pile of unknown license, which is normal as we want to use scancode to make sure our proprietary software is not mixed up with open-source, and that our devs use packaging techniques to compose software.

FWIW, any incorrect detection is treated as a bug (so tickets are mucho welcome!) AND @AyanSinhaMahapatra 's https://github.com/nexB/scancode-analyzer/ is a new, emerging tool to spot and potentially fix these issues using multiple approaches including some ML.

I am not sure if for custom license there is really a usecase where those number will be so big that they need to be cached as well. The needed cache module refactoring seems quite scary to me.

No worries there, it's not that complicated

What I like with secondary index is that we could skip primary matching all together if the secondary index match score is high enough.
This could open the path to a quick scan mode that we could put in the pre-commit CI.

Question: if you were to use a secondary index in your case, would you see an exclusive us of that index for a given scan run and not the main one? or would you see the use of boths at the same time?

tardyp · 2021-04-01T14:59:10Z

incorrect detection is treated as a bug

I don't say it is incorrect detection, as those are mostly files, which are our proprietary license, and I don't expect scancode to magically detect it.
We have our own spdx identifiers, and scancode detects that as unknown-spdx, maybe that could be enhanced

Question: if you were to use a secondary index in your case, would you see an exclusive us of that index for a given scan run and not the main one? or would you see the use of boths at the same time?

I would use both.

For each file, if the secondary index detects with 100% score that this is our copyright, don't bother run the rest of the rules.
If a file does not match one of our proprietary licenses, we try to detect based on the primary database (and we afford this takes 250ms per file)

codeakki · 2021-04-04T15:54:52Z

Hey May I know how scancode-toolkit create dataset for agent

tardyp · 2021-04-04T17:59:05Z

Hi @codeakki ,
the dataset is stored inside the source code:
https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data

pombredanne · 2021-04-23T12:57:25Z

I am closing this in favor of the older #480

tardyp added the new feature label Mar 31, 2021

tardyp mentioned this issue Apr 23, 2021

Wrong spdx detection for file generator #2502

Open

pombredanne mentioned this issue Apr 23, 2021

Add support for "extra", e.g. private or local licenses #480

Closed

pombredanne closed this as completed Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom License Rules folder #2471

Custom License Rules folder #2471

tardyp commented Mar 31, 2021 •

edited

Loading

mjherzog commented Mar 31, 2021

pombredanne commented Mar 31, 2021

tardyp commented Apr 1, 2021

pombredanne commented Apr 1, 2021

tardyp commented Apr 1, 2021

codeakki commented Apr 4, 2021

tardyp commented Apr 4, 2021

pombredanne commented Apr 23, 2021

Custom License Rules folder #2471

Custom License Rules folder #2471

Comments

tardyp commented Mar 31, 2021 • edited Loading

Short Description

Possible Labels

Select Category

How This Feature will help you/your organization

Possible Solution/Implementation Details

Can you help with this Feature

mjherzog commented Mar 31, 2021

pombredanne commented Mar 31, 2021

tardyp commented Apr 1, 2021

pombredanne commented Apr 1, 2021

tardyp commented Apr 1, 2021

codeakki commented Apr 4, 2021

tardyp commented Apr 4, 2021

pombredanne commented Apr 23, 2021

tardyp commented Mar 31, 2021 •

edited

Loading