How to best model the data? #5

dpriskorn · 2023-12-20T16:08:39Z

dpriskorn
Dec 20, 2023
Maintainer

I created this datamodel today for the Riksdagen open data to sentences project and I would like some feedback from the community.

Basically the idea is to analyze all 160k documents and store every single unique rawtoken and sentence in a database.
This is going to be a huge database which I'm not sure ToolsDB can handle (WMF recommend Trove for databases >125 GB)

I want to store normalized tokens and later I want to link the raw tokens to Wikidata Lexeme Form IDs.

I'm curious to see:

How many unique rawtokens vs normalized tokens we see on average in the documents.
How many of the raw tokens can be found in Wikidata currently (lexeme form coverage)
Which are the most common raw tokens which are currently missing in Wikidata as forms?

The different tables are explained in the UML here:
https://github.com/dpriskorn/riksdagen_sentences/blob/save_to_database/diagrams/datamodel.puml

dpriskorn · 2023-12-20T17:24:04Z

dpriskorn
Dec 20, 2023
Maintainer Author

Updated model:

0 replies

dpriskorn · 2023-12-20T17:25:47Z

dpriskorn
Dec 20, 2023
Maintainer Author

Question:

Do we want a UUID on the rawtoken also? I'm thinking no. Searching based on representation and lexical category like this should be good enough IMO:

        query = '''
            SELECT *
            FROM rawtoken
            JOIN lexical_category ON rawtoken.lexical_category_id = lexical_category.id
            WHERE rawtoken.text = ?
            AND rawtoken.lexical_category_id = ?
            '''

0 replies

dpriskorn · 2023-12-21T11:03:41Z

dpriskorn
Dec 21, 2023
Maintainer Author

The model now looks like this after a few iterations and coding of almost all the logic necessary:

The only logic now missing is the linking between raw tokens and possibly relevant lexeme forms in Wikidata. I'm thinking we can create these links as suggestions automatically and user could verify it in an interface if we want that.

Question: Do we want to extend the model to support linking between normtokens and lexeme forms also? I'm thinking we don't need that.

0 replies

dpriskorn · 2023-12-24T20:16:49Z

dpriskorn
Dec 24, 2023
Maintainer Author

Updated model with NER output on a sentence level

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to best model the data? #5

{{title}}

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to best model the data? #5

dpriskorn Dec 20, 2023 Maintainer

Replies: 4 comments

dpriskorn Dec 20, 2023 Maintainer Author

dpriskorn Dec 20, 2023 Maintainer Author

dpriskorn Dec 21, 2023 Maintainer Author

dpriskorn Dec 24, 2023 Maintainer Author

dpriskorn
Dec 20, 2023
Maintainer

dpriskorn
Dec 20, 2023
Maintainer Author

dpriskorn
Dec 20, 2023
Maintainer Author

dpriskorn
Dec 21, 2023
Maintainer Author

dpriskorn
Dec 24, 2023
Maintainer Author