CI: check of Scribe-Data Wikidata queries #48

andrewtavis · 2023-09-18T18:36:38Z

Terms

I have searched open and closed feature requests
I agree to follow Scribe-Data's Code of Conduct

Description

This issue would add a test to Scribe-Data/tests that would make sure that the data queries are running effectively. For this the general idea would be that we'd load in the SPARQL queries and then put a LIMIT 1 or similar at the end of them via adding to the strings. The very low limit is because we don't want to be burdening the Wikidata endpoints too much, but maybe we can increase it if need be. We'd then need to check that everything works within the CI process.

Contribution

Happy to work on this with someone or get to it myself eventually 😊

The text was updated successfully, but these errors were encountered:

m-charlton · 2023-09-27T18:49:48Z

Am I right in assuming that this task would do the following:

Load a SPARQL query from a text file
Substitute say QID for the actual QID
Set the 'LIMIT' to 1
Execute the query and wait for a response

If the response is O.K then the query was correctly formatted and so the test has passed. If the response is not O.K and there was network connectivity, then the query was incorrectly formatted and the test has failed.

Would this test be applied to all SPARQL queries or, a subset?

andrewtavis · 2023-09-27T19:12:30Z

Hey @m-charlton 👋 I think we can just do LIMIT 1 and not set an actual QID. Let me check with @wkyoshida on how many of them we should check. What are you thinking? We need all of them to obviously be working, but then they won't be changing very much.

m-charlton · 2023-09-27T20:59:40Z

Still getting my thoughts together on this, so a lot of what I'm about to say may be unrealistic or, not a good idea. Here goes:

Integration tests

I don't see these tests being run as often as unit tests. There's a pytest integration plugin which could be usefull in this respect.

Run as a development tool

Options would include --all: run all SPARQL queries, --changed: only run those queries that have changed. This tool could be run manually and/or as a git hook. Don't know if this is possible.

Either way it would be a good idea to test queries that have chaged and the gitpython might help us with this.

wkyoshida · 2023-09-28T02:48:42Z

That sounds great, @m-charlton 🙌

It's true that these tests perhaps do lean more on the side of integration testing rather than unit testing - something that @andrewtavis and I have discussed before actually. Here are some additional thoughts on the topic:

I did feel fine actually to still run some of these integration tests as part of the CI run - mostly due to the fact that we simply don't have enough changes coming in at a rate that would be troublesome 😆 (maybe this changes in the future)
But - I am a fan of your idea of running only --changed as part of the CI. That does make sense to me as what we'd likely want to achieve with our CI
@andrewtavis, to your question, I am thinking though of perhaps having a test sample of at least larger than 1. Just to be sure things don't only work for the singular result of LIMIT 1. Nothing crazy though, a LIMIT 3 or LIMIT 5 is probably fine
I am wondering though if running --all as @m-charlton pointed out could be a good idea at certain frequencies. Not sure if SPARQL/Wikidata could be subject to change - maybe yes, maybe not (not sure if @andrewtavis would have a better idea), but if it could change, it could be useful to occasionally run --all just to make sure all queries still work fine

andrewtavis · 2023-09-28T14:57:42Z

But - I am a fan of your idea of running only --changed as part of the CI. That does make sense to me as what we'd likely want to achieve with our CI

I second this 🙋‍♂️ And LIMIT 3/5 also works for me :) Let's just do 5 🙃

Not sure if SPARQL/Wikidata could be subject to change

We have a fairly robust Stable Interface Policy, and I really don't see any issues with this in the near future. We have valid SPARQL queries going to the appropriate endpoints through a relatively maintained package, so really our only worry would be that the endpoints change. The are hopes that part of Wikidata will migrate to Wikibase instances eventually, but I've never heard Lexemes mentioned in this way as the team and community is really focussing on growing it right now. Maybe eventually if it gets huge - minimum years from now - and we'll have lots of forewarning.

Having the option to run --all from time to time would be good. For Scribe-Server the update will be what, bi-weekly is what we were saying? Maybe we can set up a job in Server to run on the off week just so we know there's a problem and would have a week to figure it out?

Btw, @m-charlton, I see that you're in the Matrix space! Great to have you :) We'll be talking next week in the General channel about a bi-weekly developers call we'll be settings up (we like things that happen every two weeks 😊). Would be great to have you join the channel and the eventual call!

m-charlton · 2023-09-28T15:28:14Z

I think the next step is for me to come up with a prototype: checkquery.py. Here's the first thoughts on the interface:

Arguments

(-h|--help): prints usage information
(-e|--endpoint): specify the SPARQL endpoint. Default = "https://query.wikidata.org/sparql"
(-p|--ping): 'ping' the endpoint to check connectivity.
(-a|--all): check all SPARQL queries.
(-c|--changed): check only those SPARQL queries that have changed.
(--path): can be either a file, containing a query, or a directory. If a directory then only queries in and below will be checked. Default = "Scribe-Data/src"
(-l|--limit): the maximum number of results each successful query should return. Default = 5.
(-v|--verbose): print out extra information. E.g. The results of the query.

These are just some ideas. Feedback/questions appreciated and ideas as to what features would be needed for a minimum viable product. I've not yet figured out how to implement the --changed feature, so this may be delivered after the initial release.

Successes printed to STDOUT. Failures to STDERR. The script will terminate with an exit code of 0 - indicating success. Any other exit code indicates failure.

Should be around next week, look forward to meeting you all.

andrewtavis · 2023-09-28T15:52:22Z

We can do some further research to figure out --changed later and maybe make an issue for it specifically, @m-charlton :) Thanks for all the ideation here! Really looking forward to getting this up and running!

m-charlton · 2023-09-30T14:45:18Z

I've written a prototype CLI tool checkquery:

checkquery --help

usage: checkquery [-h] (-c | -a | -f FILE | -p) [-e ENDPOINT] [-l LIMIT] [-v] [--timeout TIMEOUT]

run SPARQL queries from the 'Scribe-Data' project

optional arguments:
  -h, --help            show this help message and exit
  -c, --changed         run only changed/new SPARQL queries (default: False)
  -a, --all             run all SPARQL queries (default: False)
  -f FILE, --file FILE  path to a file containing a valid SPARQL query (default: None)
  -p, --ping            check responsiveness of configured SPARQL endpoint (default: False)
  -e ENDPOINT, --endpoint ENDPOINT
                        URL of the SPARQL endpoint (default: https://query.wikidata.org/sparql)
  -l LIMIT, --limit LIMIT
                        the maximum number or results a query should return (default: 5)
  -v, --verbose         increase output verbosity (default: False)
  --timeout TIMEOUT     maximum number of seconds to wait for a response from the endpoint when 'pinging' (default: 10)

A major constraint is that it must be executed in a project under git control. In this case Data-Scribe/ or any of its sub-dirs. This is because the tool uses git to figure out what's changed.

Still a lot to do especially with robustness testing and success/failure reporting.

m-charlton · 2023-10-03T21:31:11Z

Please find enclosed a beta quality release of checkquery. Directions:

unzip in the Scribe-Data directory.
chmod u+x checkquery? May not be necessary.
run ./checkquery --help for usage message.

Done some testing and the most likely error you will see is HTTP Error 429: Too Many Requests which is self explanatory. Thinking of some solutions involving multiple attempts and increasing the wait between attempts.

A lot to do around:

reporting
usage message needs tweaking
test cases need to be written

Feedback appreciated.

checkquery.zip

andrewtavis · 2023-10-06T17:23:58Z

Heyo @m-charlton! I'm slowly getting back into the groove of things post being sick. Very sorry for the wait on all this. As we discussed, we have a dev sync tomorrow at 14:00 UTC that would be great to have you join. Details for the call are in the General channel on Matrix.

Will get to the above in the coming days as well as the PR for #50. Thanks again so much! 🙏

m-charlton · 2023-10-06T19:16:23Z

@andrewtavis I should be about tomorrow. If so I'll see you then. I've been working on checkquery and made the following improvements:

Reporting

Now uses tqdm as other tools in the project do. Error reporting has been improved. Still some tweeking to do.

Robustnees

Any HTTP error (e.g. 429 'too many requests) will not fail immediately. Instead it will wait a number of seconds & execute the query again. It will attempt to execute the query for a maximum of three times before giving up & failing that query.

Testing

More extensive ad-hoc testing undertaken. Still need to think about unit tests.

Please ignore all previous versions. Feedback & ideas appreciated.

checkquery.zip

…-org#48)" This reverts commit 2680a89.

feat(checkquery): tool for checking SPARQL query files (resolves #48)

andrewtavis · 2023-11-04T01:30:18Z

Closed via #53 :) Let's discuss what's next during the sync, @m-charlton! Would be interested to know what you'd like to work on or if you have some ideas for how to improve things!

Really thanks so much! This is going to make the transfer to Scribe-Server so much better from the start 🚀

andrewtavis added feature New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 18, 2023

andrewtavis added this to Scribe Board Sep 18, 2023

github-project-automation bot moved this to Todo in Scribe Board Sep 18, 2023

andrewtavis changed the title ~~CI: check of Scribe-Data queries~~ CI: check of Scribe-Data Wikidata queries Sep 18, 2023

m-charlton mentioned this issue Sep 26, 2023

Add tests for utility functions (resolves #50) #51

Merged

1 task

m-charlton mentioned this issue Oct 17, 2023

feat(checkquery): tool for checking SPARQL query files (resolves #48) #53

Merged

1 task

m-charlton added a commit to m-charlton/Scribe-Data that referenced this issue Nov 3, 2023

feat(checkquery.py): address review comments (resolves scribe-org#48)

2680a89

m-charlton added a commit to m-charlton/Scribe-Data that referenced this issue Nov 3, 2023

Revert "feat(checkquery.py): address review comments (resolves scribe…

f4f992b

…-org#48)" This reverts commit 2680a89.

m-charlton added a commit to m-charlton/Scribe-Data that referenced this issue Nov 3, 2023

feat(checkquery.py): address review comments (resolves scribe-org#48)

3daa991

andrewtavis closed this as completed in 3f113cb Nov 4, 2023

andrewtavis added a commit that referenced this issue Nov 4, 2023

Merge pull request #53 from m-charlton/add-checkquery-utility

bdd4175

feat(checkquery): tool for checking SPARQL query files (resolves #48)

github-project-automation bot moved this from Todo to Done in Scribe Board Nov 4, 2023

andrewtavis assigned m-charlton Nov 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: check of Scribe-Data Wikidata queries #48

CI: check of Scribe-Data Wikidata queries #48

andrewtavis commented Sep 18, 2023 •

edited

Loading

m-charlton commented Sep 27, 2023

andrewtavis commented Sep 27, 2023

m-charlton commented Sep 27, 2023

wkyoshida commented Sep 28, 2023

andrewtavis commented Sep 28, 2023

m-charlton commented Sep 28, 2023

andrewtavis commented Sep 28, 2023

m-charlton commented Sep 30, 2023 •

edited

Loading

m-charlton commented Oct 3, 2023

andrewtavis commented Oct 6, 2023

m-charlton commented Oct 6, 2023

andrewtavis commented Nov 4, 2023

CI: check of Scribe-Data Wikidata queries #48

CI: check of Scribe-Data Wikidata queries #48

Comments

andrewtavis commented Sep 18, 2023 • edited Loading

Terms

Description

Contribution

m-charlton commented Sep 27, 2023

andrewtavis commented Sep 27, 2023

m-charlton commented Sep 27, 2023

Integration tests

Run as a development tool

wkyoshida commented Sep 28, 2023

andrewtavis commented Sep 28, 2023

m-charlton commented Sep 28, 2023

Arguments

andrewtavis commented Sep 28, 2023

m-charlton commented Sep 30, 2023 • edited Loading

m-charlton commented Oct 3, 2023

andrewtavis commented Oct 6, 2023

m-charlton commented Oct 6, 2023

Reporting

Robustnees

Testing

andrewtavis commented Nov 4, 2023

andrewtavis commented Sep 18, 2023 •

edited

Loading

m-charlton commented Sep 30, 2023 •

edited

Loading