Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SKOS lookup function in Fix #415

Closed
TobiasNx opened this issue Nov 3, 2021 · 42 comments · Fixed by metafacture/metafacture-fix#229
Closed

Add SKOS lookup function in Fix #415

TobiasNx opened this issue Nov 3, 2021 · 42 comments · Fixed by metafacture/metafacture-fix#229
Assignees

Comments

@TobiasNx
Copy link
Contributor

TobiasNx commented Nov 3, 2021

In the Destatis-Fächerklassifikation Vocab there are now english prefLabels and in order to add them with metamorph/fix we need to use different mapping files for each language in order to get the prefLabels we want like https://gitlab.com/oersi/oersi-etl/-/blob/master/data/maps/subject-labels.tsv
For an english version we would need an additional list, that would need to be cared about.

But since we have a ScoHub Vocabs/Skos-‘ttl‘-files it would be nice to use them as lookup so that we do not need to create and update additional lists.

For the lookup should ttl file should be the target: e.g.: https://github.com/dini-ag-kim/hochschulfaechersystematik/blob/master/hochschulfaechersystematik.ttl
(Other skos serialization could follow)

Nice would be something like the following with mock code:

@base <https://w3id.org/kim/hochschulfaechersystematik/> .
@prefix dct: <http://purl.org/dc/terms/> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix schema: <http://schema.org/> .
@prefix vann: <http://purl.org/vocab/vann/> .

...

<n4> a skos:Concept ;
  skos:prefLabel "Mathematik, Naturwissenschaften"@de, "Mathematics, Natural Sciences"@en ;
  skos:narrower   <n36>, <n37>, <n39>, <n40>, <n41>, <n42>, <n43>, <n44> ;
  skos:notation "4" ;
  skos:topConceptOf <scheme> .

...

Idea for Fix function:

skos_lookup("element-path" ,file="[path/url]", 
[match="attribute that should be matching", matchLanguage="language of the replaced value"], 
target="attribute to be replace with", targetLanguage="language of replacing value")

file= could be a URL or a local file,
match= is default id
match= and matchLanguage= are optional
target= and targetLanguage= are always needed

Use case 1:

Find matching subject and return object of targeted predicate.

in: https://w3id.org/kim/hochschulfaechersystematik/n4

skos_lookup("path", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", target="prefLabel", targetLanguage="de")

out: Mathematik, Naturwissenschaften

Use case 2:

Find matching object value in selected predicate and return its subject.

in: Mathematics, Natural Sciences

skos_lookup("path", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl",match="prefLabel", matchLanguage="en", target="id")

out: https://w3id.org/kim/hochschulfaechersystematik/n4

Use case 3:

Find matching object value in selected predicate and return object of targeted and connected predicate.
This could be also interesting if we have SKOS files with hiddenLabels or altLabels.

in: Mathematics, Natural Sciences

skos_lookup("path", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", match="prefLabel", matchLanguage="en", target="prefLabel", targetLanguage="de")

out: Mathematik, Naturwissenschaften

Code review: @fsteeg
Functional review: @TobiasNx @acka47

@acka47 acka47 changed the title Add lookup function with ScoHub Vocabs/SKOS (Morph/Fix) Add lookup function with SkoHub Vocabs/SKOS (Morph/Fix) Apr 1, 2022
@acka47 acka47 self-assigned this Jun 2, 2022
@acka47
Copy link
Contributor

acka47 commented Jun 2, 2022

In today's meeting we decided to:

  1. work on a concrete use case (from RPB, @acka47 will provide it)
  2. focus on an implementation for Metafix
  3. support different RDF serializations
  4. take into account that we would have to address the considerations below at a later point without breaking the implementation

Further considerations:

  • What about concept schemes that are not available as single files, where you have to traverse the graph from ConceptScheme over hasTopConcept and narrower to get all the data for lookup function? (E.g. SkoHub Vocabs doesn't require you to have the ConceptScheme in one file, you could also have one file per Concept.)
  • For use cases I know from @sroertgen one would need to specify different fields (e.g. prefLabel, altLabel, hiddenLabel) to be taken into account for matching.

@acka47
Copy link
Contributor

acka47 commented Jun 2, 2022

As required, here is my use case.

In RPB data we only have notations for RPB subject, e.g. #30 _sn584060_[/]#30a_sn584070_, see here.

I can create the correct concept URI with Fix, resulting in:

{
   "subject":[
      {
         "id":"http://purl.org/lobid/rpb#n584060",
         "label":"Platzhalter Schlagwortlabel",
         "type":[
            "Concept"
         ],
         "source":{
            "id":"http://purl.org/lobid/rpb",
            "label":"Systematik der Rheinland-Pfälzischen Bibliographie"
         }
      },
      {
         "id":"http://purl.org/lobid/rpb#n584070",
         "label":"Platzhalter Schlagwortlabel",
         "type":[
            "Concept"
         ],
         "source":{
            "id":"http://purl.org/lobid/rpb",
            "label":"Systematik der Rheinland-Pfälzischen Bibliographie"
         }
      }
   ]
}

As you can see, for the label I added a generic "Platzhalter Schlagwortlabel" for now as I can not (yet) lookup labels in a SKOS file. I'd be happy to in the future do something like this in the fix:

add_field("label", lookup:"prefLabel@de", in:"http://purl.org/lobid/rpb", basedOn:"id", match:"${s}")

Where I basically specify what content should be added to the new "label" field by indicating:

  • which string to add, here the skos:prefLabel of the matched resource with language code "de"
  • the ConceptScheme to do the lookup in, here http://purl.org/lobid/rpb
  • the source field from my data to do the lookup with, here id
  • the field(s) to look for a match, here it is the subject URI in the RDF: ${s}

@acka47 acka47 assigned fsteeg and dr0i and unassigned acka47 Jun 2, 2022
@fsteeg
Copy link
Member

fsteeg commented Jun 2, 2022

add_field("label", lookup:"prefLabel@de", in:"http://purl.org/lobid/rpb", basedOn:"id", match:"${s}")

In general, I think we should implement this like the existing lookup, so something like:

lookup("subject.label", "rpb.ttl", someOption: ..., anotherOption2: ...)

@fsteeg fsteeg removed their assignment Jun 2, 2022
@blackwinter
Copy link
Member

In general, I think we should implement this like the existing lookup, so something like:

lookup("subject.label", "rpb.ttl", someOption: ..., anotherOption2: ...)

Are you sure that lookup() should be overloaded? Catmandu has lookup_in_store(), so I'd suggest either modeling it as a store and implementing that function or just naming it accordingly (lookup_in_rdf()). Otherwise, rdf_lookup() might be an acceptable name.

@fsteeg
Copy link
Member

fsteeg commented Jun 2, 2022

My view would be that essentially, we want to support one additional file format, TTL, in addition to CSV and TSV.

Since we'd probably implement this based on an RDF model anyway, we might as well support other SKOS RDF serializations (though I'm not even sure I like that idea, I'd prefer to stick to actual use cases, and we use TTL files). But a generic RDF lookup would be quite a different thing. For that, something like lookup_in_store (and then using something like a triple store) might make more sense.

@blackwinter
Copy link
Member

My view would be that essentially, we want to support one additional file format, TTL, in addition to CSV and TSV.

In principle, yes, but lookup() is specifically meant for dictionaries. And an RDF file, whatever its serialization, is conceptually quite different from a simple delimited file with key-value pairs.

But I'm unsure myself. I just think we might regret it if we overwhelmed lookup() with too many features.

[Come to think of it, maybe we shouldn't even have added local maps to it. lookup_in_map() or lookup_in_store(..., Memory) might have been more appropriate.]

@dr0i
Copy link
Member

dr0i commented Jun 2, 2022

I agree to @blackwinter - while it's in principle possible to make a Map out of RDF files, it may get complicated. And since there are Further considerations it may be better to go with an RDF store from the beginning.
As I am not a fan of external databases (brings complexity) and our scenarios make use of only little data I would start with an in-memory RDF store/model.

@dr0i dr0i removed the Metamorph label Jun 3, 2022
@fsteeg
Copy link
Member

fsteeg commented Jun 3, 2022

I don't think it helps to talk about RDF here. Spreadsheets are also much more powerful than simple dictionary lookups, yet we don't have generic spreadsheet support, we only use TSV or CSV files as simple dictionaries. Same is our plan for SKOS as I understand it: we want to use it as a simple dictionary.

@dr0i
Copy link
Member

dr0i commented Jun 3, 2022

Hm, but if you look at the scenarios @TobiasNx provided - these are not simple dictionaries? I mean, yeah, you can all things break somehow down to key-value structures, but they may not fit all purposes, e.g. "give me A, but A shall not have B and must be of Concept C". See also Semantic Reasoner. I mean, it's about skos lookup, so naturally RDF?
Maybe you can tell what's your problem with RDF? One obvious drawback is the need of heavy dependencies (going with apache jena), which on the other hand provides parsing of all kinds of RDF serializations, merging and querying in an easy and standardized way. So we could maybe provide in metafacture-fix some kind of modules like in metafacture-core?

@fsteeg
Copy link
Member

fsteeg commented Jun 3, 2022

Maybe you can tell what's your problem with RDF? One obvious drawback is the need of heavy dependencies (going with apache jena)

No problem with RDF, and I even imagined to implement this based on an RDF model, using Jena. My point is how this will be used. I think it should provide a simple way to look up values in a SKOS-TTL instead of a TSV or CSV. It should not require dealing with RDF concepts. Something like lookup(field, 'rpb.ttl') should provide a prefLabel for a concept ID. Lookup options as described above could be configured, but I think it should be that simple to use for the basic use case.

Another option in my point of view would be to add support for reading RDF data in Metafacture. We could then write a small 'preprocessing' workflow that transforms the RDF data into a lookup TSV and use that, instead of adding lookup support for SKOS.

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Jun 3, 2022

I want to hint to one advantage of an genuine SKOS lookup we can use one ttl-file for multiple lookups instead of generating multiple tsv files be it automated or manually.

In OERSI we have e.g.:

lookup("learningResourceType[].*.prefLabel.de", "data/maps/hcrt-de-labels.tsv","sep_char":"\t", delete:"true")
lookup("learningResourceType[].*.prefLabel.en", "data/maps/hcrt-en-labels.tsv","sep_char":"\t", delete:"true")

with some kind of SKOS-lookup this could be:


skos_lookup("learningResourceType[].*.prefLabel.de", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", target="prefLabel", targetLanguage="de" )
skos_lookup("learningResourceType[].*.prefLabel.de", file="https://raw.githubusercontent.com/dini-ag-kim/hochschulfaechersystematik/master/hochschulfaechersystematik.ttl", target="prefLabel", targetLanguage="en" )

@acka47
Copy link
Contributor

acka47 commented Jun 3, 2022

I want to hint to one advantage of an genuine SKOS lookup we can use one ttl-file for multiple lookups instead of generating multiple tsv files be it automated or manually

I agree with this statement as long as you say "one ConceptScheme" instead of "one ttl-file". As noted before, even with SkoHub Vocabs one Concept scheme can be spread over many files, which totally makes sense when you have a big vocab.

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Jun 3, 2022

I have updated the initial post so that the function are fix now:
#415 (comment)

I also gave an idea of the function I had in mind:
Idea for Fix function i had to solve this:

skos_lookup("element-path" ,file="[path/url]", 
[match="attribute that should be matching", matchLanguage="language of the replaced value"], 
target="attribute to be replace with", targetLanguage="language of replacing value")

file= could be a URL or a local file,
match= is default id
match= and matchLanguage= are optional
target= and targetLanguage= are always needed

@fsteeg
Copy link
Member

fsteeg commented Jun 7, 2022

target= and targetLanguage= are always needed

Wouldn't it make sense to use prefLabel as a default, and make target optional?

And are language tags required in SKOS? Even if they were, I think having a default like 'If there is only one language, use that if no target language is given' would be nice.

@acka47 acka47 changed the title Add lookup function with SkoHub Vocabs/SKOS (Morph/Fix) Add SKOS lookup function in Fix Jun 7, 2022
@sroertgen
Copy link

And are language tags required in SKOS? Even if they were, I think having a default like 'If there is only one language, use that if no target language is given' would be nice.

Languages tags in SKOS are optional:

As specified in Section 5 of the SKOS Reference, skos:prefLabel, skos:altLabel and skos:hiddenLabel provide simple labels. They are all sub-properties of rdfs:label, and are used to link a skos:Concept to an RDF plain literal, which is a character string (e.g. "love") combined with an optional language tag (e.g. "en-US") [RDF-CONCEPTS].

source: https://www.w3.org/TR/2009/NOTE-skos-primer-20090818/#seclabel

@acka47
Copy link
Contributor

acka47 commented Jun 10, 2022

@sroertgen as you are here: I know you extensively use SKOS files for normalizing data in an ETL process. Are your use cases adressed in this issue or do you see something we should keep in mind?

@sroertgen
Copy link

Yes, that is pretty much what we did in WLO. We used prefLabel, altLabel and hiddenLabel for data normalization and then assigned the id of the respective matching concept.

dr0i added a commit to metafacture/metafacture-fix that referenced this issue Jun 21, 2022
Works like fix function 'lookup', also using a Map. The Map is build dynamically
querying an RDF model.
@dr0i dr0i assigned TobiasNx and unassigned dr0i Oct 27, 2022
dr0i added a commit to metafacture/metafacture-fix that referenced this issue Nov 4, 2022
Implementation against further tests from
metafacture/metafacture-core#415 (comment).

- adapt some falsely Fix
- reuse test file "hcrt.ttl"
- one test tagged as "todo" because it needs introduction of new parameter
- reformat hcrt.ttl
dr0i added a commit to metafacture/metafacture-fix that referenced this issue Nov 4, 2022
- enable integration test
- add test

See metafacture/metafacture-core#415.
dr0i added a commit to metafacture/metafacture-fix that referenced this issue Nov 4, 2022
- enable integration test
- add test

See metafacture/metafacture-core#415.
dr0i added a commit to metafacture/metafacture-fix that referenced this issue Nov 4, 2022
- enable integration test
- add test

See metafacture/metafacture-core#415.
@dr0i
Copy link
Member

dr0i commented Nov 4, 2022

@TobiasNx I've added an optional parameter select, which takes "subject" or "object" as value. See your lookupRdfDefinedPropertyToSubject/test.fix how to use it.

@TobiasNx
Copy link
Contributor Author

TobiasNx commented Nov 10, 2022

@dr0i you can still not differentiate between different objects. right? prefLabel.de oder prefLabel.en

@TobiasNx TobiasNx assigned dr0i and unassigned TobiasNx Nov 10, 2022
@dr0i
Copy link
Member

dr0i commented Nov 10, 2022

What do you mean? Can you add a new scenario ?

@dr0i dr0i assigned TobiasNx and unassigned dr0i Nov 10, 2022
@TobiasNx
Copy link
Contributor Author

TobiasNx commented Nov 10, 2022

No, I did not look properly and I perhaps I do not understand the option.

The usecase was the old lookupRdfDefinedPropertyToSubject/

lookup_rdf("a", "./hcrt.ttl", match="http://www.w3.org/2004/02/skos/core#prefLabel", match_language: "de")

Incoming Element a: compare value with prefLabel.de in SKOS. if matches output corresponding subject/concept. Else don't change the value.

other scenario lookupRdfDefinedPropertyToProperty/:

lookup_rdf("a", "./hcrt.ttl", match="http://www.w3.org/2004/02/skos/core#prefLabel", match_language: "de", target: "http://www.w3.org/2004/02/skos/core#prefLabel", target_language: "en")

Incoming Element a: compare value with prefLabel.de in SKOS. if matches output corresponding prefLabel.en. Else don't change the value.

@TobiasNx TobiasNx assigned dr0i and unassigned TobiasNx Nov 10, 2022
@dr0i
Copy link
Member

dr0i commented Nov 10, 2022

In the diff if the last commit (metafacture/metafacture-fix@765c224) I gave an example:

lookup_rdf('a', '../../../../../maps/hcrt.ttl', target: 'http://www.w3.org/2004/02/skos/core#prefLabel', target_language: 'de', select: 'subject')

@dr0i dr0i assigned TobiasNx and unassigned dr0i Nov 10, 2022
@dr0i
Copy link
Member

dr0i commented Nov 18, 2022

Enabled lookupRdfDefinedPropertyToProperty.

Starting documenation here:
Parameters have the following meaning (Abbrevations: S (Subject) ,P (Property), O(Object)):

mandatory: target: this is the targeted P. This is searched , and from here it's either resolved:

O) the O either if

(getting O is in conjunction with an optional select_language. If you don't use this option note that any O can be retrieved - if there are language versions in the data it's not guaranteed which version is choosen).

S) the S either if

P) another language version (O) of P either if

Following the optional parameters (a likely redundant explanation - it's already noted in the mandatory section):
optional: select_language: either
a) the language version of P (when availlable) if input data is an existing S and no select or
b) an S having a P with the value of the non-URI data input or if in conjunction with select :"subject"
b) an O having a P with the value of the non-URI data input and in conjunction with select :"object"

optional select: forces if S or O is returned

dr0i added a commit to metafacture/metafacture-fix that referenced this issue Dec 13, 2022
Works like fix function 'lookup', also using a Map. The Map is build dynamically
querying an RDF model.
dr0i pushed a commit to metafacture/metafacture-fix that referenced this issue Dec 13, 2022
dr0i added a commit to metafacture/metafacture-fix that referenced this issue Dec 13, 2022
Implementation against further tests from
metafacture/metafacture-core#415 (comment).

- adapt some falsely Fix
- reuse test file "hcrt.ttl"
- one test tagged as "todo" because it needs introduction of new parameter
- reformat hcrt.ttl
dr0i added a commit to metafacture/metafacture-fix that referenced this issue Dec 13, 2022
- enable integration test
- add test

See metafacture/metafacture-core#415.
@TobiasNx TobiasNx assigned dr0i and unassigned TobiasNx Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants