-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid unnecessary API call in QueryJob.result() when job is already finished #1900
Conversation
CC @chalmerlowe |
@@ -1316,37 +1343,6 @@ def test_result_w_empty_schema(self): | |||
self.assertEqual(result.location, "asia-northeast1") | |||
self.assertEqual(result.query_id, "xyz-abc") | |||
|
|||
def test_result_invokes_begins(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of #967 released in google-cloud-bigquery 3.0.0, the _begin method is no longer used for query jobs.
@@ -1430,40 +1430,6 @@ def _reload_query_results( | |||
timeout=transport_timeout, | |||
) | |||
|
|||
def _done_or_raise(self, retry=DEFAULT_RETRY, timeout=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was overridden because we wanted result()
from the superclass to call jobs.getQueryResults
, not just jobs.get
(i.e. job.reload()
in Python). Now that we aren't using the superclass for result()
, this method is no longer necessary.
try: | ||
self.reload(retry=retry, timeout=transport_timeout) | ||
except exceptions.GoogleAPIError as exc: | ||
self.set_exception(exc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought: We probably should have been calling set_exception
based on the job status. Need to look into this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. We are. 😅
self.set_exception(exception) |
Which we call from _set_properties
self._set_future_result() |
Which we call from reload
self._set_properties(api_response) |
# wait for the query to finish. Unlike most methods, | ||
# jobs.getQueryResults hangs as long as it can to ensure we | ||
# know when the query has finished as soon as possible. | ||
self._reload_query_results(retry=retry, timeout=timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh, if jobs.getQueryResults
fails because the job failed it can throw an exception but restart_query_job
will still be False
.
But we don't want restart_query_job = True
because sometimes this can raise an ambiguous exception such as quota exceeded, where we don't know if it's the job quota and it's a failed job or at a higher level (Google Frontend - GFE) where the job might actually still be running and/or succeeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't the worst way to fail, but it'd be nice to do the jobs.get
call above in case of an exception to get a chance at retrying this job if the job failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am halfway through my review.
Releasing these comments for now.
Will come back to this to finish out my review as soon as possible.
google/cloud/bigquery/job/query.py
Outdated
|
||
do_get_result() | ||
# timeout can be `None` or an object from our superclass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which superclass
are we discussing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
google.api_core.future.polling.PollingFuture._DEFAULT_VALUE
introduced in googleapis/python-api-core#462.
I've updated the comments with some more info as well as some things to consider in case we want to have a default value for timeout in future.
google/cloud/bigquery/retry.py
Outdated
@@ -39,7 +39,7 @@ | |||
# Allow for a few retries after the API request times out. This relevant for | |||
# rateLimitExceeded errors, which can be raised either by the Google load | |||
# balancer or the BigQuery job server. | |||
_DEFAULT_JOB_DEADLINE = 3.0 * _DEFAULT_RETRY_DEADLINE | |||
_DEFAULT_JOB_DEADLINE = 4.0 * _DEFAULT_RETRY_DEADLINE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of using 4.0 here?
Can we get a comment indicating why 4.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to 2.0 * (2.0 * _DEFAULT_RETRY_DEADLINE)
and added some explanation both here and in QueryJob.result()
.
Note: This still only gets us 1 query retry in the face of the problematic ambiguous error codes from jobs.getQueryResults()
but that's better than the nothing that we were actually getting before in some cases. I don't feel comfortable bumping this much further, though maybe 3.0 * 2.0 * _DEFAULT_RETRY_DEADLINE would be slightly less arbitrary at 1 hour?
@@ -970,7 +877,12 @@ def test_result(self): | |||
"rows": [{"f": [{"v": "abc"}]}], | |||
} | |||
conn = make_connection( | |||
query_resource, query_resource_done, job_resource_done, query_page_resource | |||
job_resource, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I am tracking the relationship between the make connection inputs versus the assert_has_calls checks.
Can you explain how these tests are supposed to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make_connection
is a convention in google-cloud-bigquery
unit tests that actually predates our use of the "mock" package. It mocks out the responses to REST API calls, previously with a fake implementation of our "Connection" class from the _http
module and now with a true mock object. For every quest that our test makes, there should be a corresponding response. As with Mock.side_effect, any exceptions in this list will be raised, instead.
I'm guessing your question also relates to "Why this particular set of requests / responses?". I've added some comments explaining why we're expecting this sequence of API calls. I've also updated this test to more explicitly check for a possible cause of customer issue b/332850329.
@@ -1289,7 +1217,18 @@ def test_result_w_retry(self): | |||
) | |||
|
|||
connection.api_request.assert_has_calls( | |||
[query_results_call, query_results_call, reload_call] | |||
[ | |||
reload_call, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thing here. Can I get some clarity on what we are doing and looking for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some explanation here as well as above in the make_connection()
call.
Co-authored-by: Chalmer Lowe <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
tests/unit/job/test_query.py
Outdated
|
||
job.result() | ||
with freezegun.freeze_time("1970-01-01 00:00:00", tick=False): | ||
job.result(timeout=1.125) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there are reason we are using such a specific number?
1.125
.
Can I get a comment here to let future me know why we picked this number?
method="GET", | ||
path=f"/projects/{self.PROJECT}/queries/{self.JOB_ID}", | ||
query_params={ | ||
"maxResults": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is maxResults
of 0
synonymous with asking for all results? OR is it really asking for zero results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually want 0 rows. If we omit this or ask for non-zero number of rows, the jobs.getQueryResults
API can hang when the query has wide rows (many columns).
Co-authored-by: Chalmer Lowe <[email protected]>
…ast-fix-query-retry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the comments, etc.
Future me thanks you as well.
LGTM, APPROVED.
BEGIN_COMMIT_OVERRIDE
perf: avoid unnecessary API call in
QueryJob.result()
when job is already finished (#1900)fix: retry query jobs that fail even with ambiguous
jobs.getQueryResults
REST errors (#1903, #1900)END_COMMIT_OVERRIDE
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
query_and_wait
for lower-latency small queries python-bigquery-magics#15 since this loop for waiting for the query to finish will be easier to add a progress bar (if we decide that's needed)