-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tests are extremely flaky #4433
Comments
After looking through some CI failures that occurred for commits to master (i.e. after the CI tests had already passed for that PR and still failed), I've identified the following tests that failed (and therefore are flaky):
Will now start looking through each of these in a PR... |
Kill/run First, note that the This one perplexes me. So, in the worker logs for the sharedFS run that failed here, we see the following:
Note first that From these logs, we see this: Now, a bundle's terminal state is set here in the rest server. For the terminal state to be READY, it would have to be the case that the failure_message and exit_code are none. Therefore, that bundle did not have a failure_message when its status was sent up to the rest-server -- even though it had to have had a failure message at some point when it was transitioning to finalizing since that was logged. (Note: before doing this, the failure_message and exit_code are added to the bundle metadata in the transition_bundle_finalizing function here). So, what is happening here? Currently, I'm not sure. I have logged worker_run.as_dict in the PR for the fix to the tests to see if we can pick up worker_run dict for the next |
Make The test that's failing in the example given is this one here (uploading local files to Azure Blob storage). The error is My hypothesis is this: for the 'make' command in the bundle CLI, we see that the make bundle is first created -- meaning that the rest server will add it to the database as a MAKE bundle -- and then bundle location for it is added slightly later. My thinking right now is that there's a race condition wherein the For the fix, maybe Jiani and Ashwin can help out since they worked on this portion of the CLI. |
Resources The two linked CI failures here and here are actually different failures within resources, which is interesting. For the first failure, it looks like the bundle with the command Not sure what happened with the second one; the failure message is |
Maybe for the first one we should try not connecting to Google / a live site? |
Maybe it's not allocating 10 MB to the kubernetes pod... |
I think it's intentionally connecting to a live site because it's testing network access? The lines of code for that test are here:
Yeah, that could be... |
Yeah but is it possible that Google servers are doing something weird? Maybe try connecting to another server like https://en.wikipedia.org/? |
Related to #4433 -- let's see if this fixes flakiness.
Testing in #4457 |
* Test - change google to wikipedia for network request Related to #4433 -- let's see if this fixes flakiness. * Update test_cli.py --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
The CI tests are very flaky. Often, tests will fail and then pass when they are re-run.
The text was updated successfully, but these errors were encountered: