Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI: check of Scribe-Data Wikidata queries #48

Closed
2 tasks done
andrewtavis opened this issue Sep 18, 2023 · 12 comments
Closed
2 tasks done

CI: check of Scribe-Data Wikidata queries #48

andrewtavis opened this issue Sep 18, 2023 · 12 comments
Assignees
Labels
feature New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@andrewtavis
Copy link
Member

andrewtavis commented Sep 18, 2023

Terms

Description

This issue would add a test to Scribe-Data/tests that would make sure that the data queries are running effectively. For this the general idea would be that we'd load in the SPARQL queries and then put a LIMIT 1 or similar at the end of them via adding to the strings. The very low limit is because we don't want to be burdening the Wikidata endpoints too much, but maybe we can increase it if need be. We'd then need to check that everything works within the CI process.

Contribution

Happy to work on this with someone or get to it myself eventually 😊

@andrewtavis andrewtavis added feature New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Sep 18, 2023
@andrewtavis andrewtavis changed the title CI: check of Scribe-Data queries CI: check of Scribe-Data Wikidata queries Sep 18, 2023
@m-charlton
Copy link
Contributor

Am I right in assuming that this task would do the following:

  • Load a SPARQL query from a text file
  • Substitute say QID for the actual QID
  • Set the 'LIMIT' to 1
  • Execute the query and wait for a response

If the response is O.K then the query was correctly formatted and so the test has passed. If the response is not O.K and there was network connectivity, then the query was incorrectly formatted and the test has failed.

Would this test be applied to all SPARQL queries or, a subset?

@andrewtavis
Copy link
Member Author

Hey @m-charlton 👋 I think we can just do LIMIT 1 and not set an actual QID. Let me check with @wkyoshida on how many of them we should check. What are you thinking? We need all of them to obviously be working, but then they won't be changing very much.

@m-charlton
Copy link
Contributor

Still getting my thoughts together on this, so a lot of what I'm about to say may be unrealistic or, not a good idea. Here goes:

Integration tests

I don't see these tests being run as often as unit tests. There's a pytest integration plugin which could be usefull in this respect.

Run as a development tool

Options would include --all: run all SPARQL queries, --changed: only run those queries that have changed. This tool could be run manually and/or as a git hook. Don't know if this is possible.

Either way it would be a good idea to test queries that have chaged and the gitpython might help us with this.

@wkyoshida
Copy link
Member

That sounds great, @m-charlton 🙌

It's true that these tests perhaps do lean more on the side of integration testing rather than unit testing - something that @andrewtavis and I have discussed before actually. Here are some additional thoughts on the topic:

  • I did feel fine actually to still run some of these integration tests as part of the CI run - mostly due to the fact that we simply don't have enough changes coming in at a rate that would be troublesome 😆 (maybe this changes in the future)
  • But - I am a fan of your idea of running only --changed as part of the CI. That does make sense to me as what we'd likely want to achieve with our CI
  • @andrewtavis, to your question, I am thinking though of perhaps having a test sample of at least larger than 1. Just to be sure things don't only work for the singular result of LIMIT 1. Nothing crazy though, a LIMIT 3 or LIMIT 5 is probably fine
  • I am wondering though if running --all as @m-charlton pointed out could be a good idea at certain frequencies. Not sure if SPARQL/Wikidata could be subject to change - maybe yes, maybe not (not sure if @andrewtavis would have a better idea), but if it could change, it could be useful to occasionally run --all just to make sure all queries still work fine

@andrewtavis
Copy link
Member Author

But - I am a fan of your idea of running only --changed as part of the CI. That does make sense to me as what we'd likely want to achieve with our CI

I second this 🙋‍♂️ And LIMIT 3/5 also works for me :) Let's just do 5 🙃

Not sure if SPARQL/Wikidata could be subject to change

We have a fairly robust Stable Interface Policy, and I really don't see any issues with this in the near future. We have valid SPARQL queries going to the appropriate endpoints through a relatively maintained package, so really our only worry would be that the endpoints change. The are hopes that part of Wikidata will migrate to Wikibase instances eventually, but I've never heard Lexemes mentioned in this way as the team and community is really focussing on growing it right now. Maybe eventually if it gets huge - minimum years from now - and we'll have lots of forewarning.

Having the option to run --all from time to time would be good. For Scribe-Server the update will be what, bi-weekly is what we were saying? Maybe we can set up a job in Server to run on the off week just so we know there's a problem and would have a week to figure it out?

Btw, @m-charlton, I see that you're in the Matrix space! Great to have you :) We'll be talking next week in the General channel about a bi-weekly developers call we'll be settings up (we like things that happen every two weeks 😊). Would be great to have you join the channel and the eventual call!

@m-charlton
Copy link
Contributor

I think the next step is for me to come up with a prototype: checkquery.py. Here's the first thoughts on the interface:

Arguments

  • (-h|--help): prints usage information
  • (-e|--endpoint): specify the SPARQL endpoint. Default = "https://query.wikidata.org/sparql"
  • (-p|--ping): 'ping' the endpoint to check connectivity.
  • (-a|--all): check all SPARQL queries.
  • (-c|--changed): check only those SPARQL queries that have changed.
  • (--path): can be either a file, containing a query, or a directory. If a directory then only queries in and below will be checked. Default = "Scribe-Data/src"
  • (-l|--limit): the maximum number of results each successful query should return. Default = 5.
  • (-v|--verbose): print out extra information. E.g. The results of the query.

These are just some ideas. Feedback/questions appreciated and ideas as to what features would be needed for a minimum viable product. I've not yet figured out how to implement the --changed feature, so this may be delivered after the initial release.

Successes printed to STDOUT. Failures to STDERR. The script will terminate with an exit code of 0 - indicating success. Any other exit code indicates failure.

Should be around next week, look forward to meeting you all.

@andrewtavis
Copy link
Member Author

We can do some further research to figure out --changed later and maybe make an issue for it specifically, @m-charlton :) Thanks for all the ideation here! Really looking forward to getting this up and running!

@m-charlton
Copy link
Contributor

m-charlton commented Sep 30, 2023

I've written a prototype CLI tool checkquery:

checkquery --help

usage: checkquery [-h] (-c | -a | -f FILE | -p) [-e ENDPOINT] [-l LIMIT] [-v] [--timeout TIMEOUT]

run SPARQL queries from the 'Scribe-Data' project

optional arguments:
  -h, --help            show this help message and exit
  -c, --changed         run only changed/new SPARQL queries (default: False)
  -a, --all             run all SPARQL queries (default: False)
  -f FILE, --file FILE  path to a file containing a valid SPARQL query (default: None)
  -p, --ping            check responsiveness of configured SPARQL endpoint (default: False)
  -e ENDPOINT, --endpoint ENDPOINT
                        URL of the SPARQL endpoint (default: https://query.wikidata.org/sparql)
  -l LIMIT, --limit LIMIT
                        the maximum number or results a query should return (default: 5)
  -v, --verbose         increase output verbosity (default: False)
  --timeout TIMEOUT     maximum number of seconds to wait for a response from the endpoint when 'pinging' (default: 10)

A major constraint is that it must be executed in a project under git control. In this case Data-Scribe/ or any of its sub-dirs. This is because the tool uses git to figure out what's changed.

Still a lot to do especially with robustness testing and success/failure reporting.

@m-charlton
Copy link
Contributor

Please find enclosed a beta quality release of checkquery. Directions:

  • unzip in the Scribe-Data directory.
  • chmod u+x checkquery? May not be necessary.
  • run ./checkquery --help for usage message.

Done some testing and the most likely error you will see is HTTP Error 429: Too Many Requests which is self explanatory. Thinking of some solutions involving multiple attempts and increasing the wait between attempts.

A lot to do around:

  • reporting
  • usage message needs tweaking
  • test cases need to be written

Feedback appreciated.

checkquery.zip

@andrewtavis
Copy link
Member Author

Heyo @m-charlton! I'm slowly getting back into the groove of things post being sick. Very sorry for the wait on all this. As we discussed, we have a dev sync tomorrow at 14:00 UTC that would be great to have you join. Details for the call are in the General channel on Matrix.

Will get to the above in the coming days as well as the PR for #50. Thanks again so much! 🙏

@m-charlton
Copy link
Contributor

@andrewtavis I should be about tomorrow. If so I'll see you then. I've been working on checkquery and made the following improvements:

Reporting

Now uses tqdm as other tools in the project do. Error reporting has been improved. Still some tweeking to do.

Robustnees

Any HTTP error (e.g. 429 'too many requests) will not fail immediately. Instead it will wait a number of seconds & execute the query again. It will attempt to execute the query for a maximum of three times before giving up & failing that query.

Testing

More extensive ad-hoc testing undertaken. Still need to think about unit tests.

Please ignore all previous versions. Feedback & ideas appreciated.

checkquery.zip

m-charlton added a commit to m-charlton/Scribe-Data that referenced this issue Nov 3, 2023
m-charlton added a commit to m-charlton/Scribe-Data that referenced this issue Nov 3, 2023
m-charlton added a commit to m-charlton/Scribe-Data that referenced this issue Nov 3, 2023
andrewtavis added a commit that referenced this issue Nov 4, 2023
feat(checkquery): tool for checking SPARQL query files (resolves #48)
@github-project-automation github-project-automation bot moved this from Todo to Done in Scribe Board Nov 4, 2023
@andrewtavis
Copy link
Member Author

Closed via #53 :) Let's discuss what's next during the sync, @m-charlton! Would be interested to know what you'd like to work on or if you have some ideas for how to improve things!

Really thanks so much! This is going to make the transfer to Scribe-Server so much better from the start 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
Archived in project
Development

No branches or pull requests

3 participants