-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add use_bqstorage_api option to read_gbq #270
ENH: Add use_bqstorage_api option to read_gbq #270
Conversation
The BigQuery Storage API provides a way to read query results quickly (and using multiple threads). It only works with large query results (~125 MB), but as of 1.11.1, the google-cloud-bigquery library can fallback to the BigQuery API to download results when a request to the BigQuery Storage API fails. As this API can increase costs (and may not be enabled on the user's project), this option is disabled by default.
ed4aaf2
to
5b34b28
Compare
FYI: I ran the test I added with the BQ Storage API on a 4 CPU-node instance.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
<https://github.com/googleapis/google-cloud-python/pull/7633>`__ | ||
(fixed in version 1.11.0), you must write your query results to a | ||
destination table. To do this with ``read_gbq``, supply a | ||
``configuration`` dictionary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could only allow importing this for versions >1.11.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts on this:
- I'd be more comfortable bumping the required version once there's been a few more versions of
google-cloud-bigquery
released past 1.11. I like people to have a few options in case there's a problematic regression in 1.11.1. - The failure on small results will be quite obvious (
InternalError
stacktrace) if they do try to use with an older version. If someone really wants to try to use it with an older version, they'll find out pretty quickly why they should upgrade, or they really know what they are doing and that they are always going to get results that are large enough for the BigQuery Storage API to read from. - Someday we might want to make
use_bqstorage_api=True
the default (maybe after the BQ Storage API goes GA?). I think it'd definitely be worth bumping the minimum version at that time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that sounds v reasonable. Agree re making the default True when it's GA (or at least soliciting opinions)
tbc I was suggesting only for using bqstorage do we make the min version 1.11 (probably implemented above when attempting the import). Other versions would still run everything else
@@ -727,6 +740,21 @@ def _localize_df(schema_fields, df): | |||
return df | |||
|
|||
|
|||
def _make_bqstorage_client(use_bqstorage_api, credentials): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this take a project too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BigQuery Storage API client doesn't actually take a project. The project is set when creating a BQ Storage API read session by to_dataframe
based on the BQ client's project.
Aside: The BQ Storage API client is partially auto-generated, and the client generator doesn't support a default project ID. Since I'm considering the BQ Storage API a lower-level client that's used by the normal BQ client from to_dataframe
, I felt it wasn't worth adding in that kind of handwritten helper to the autogenned layer.
docs/source/changelog.rst
Outdated
@@ -39,8 +39,16 @@ Enhancements | |||
(contributed by @johnpaton) | |||
- Read ``project_id`` in :func:`to_gbq` from provided ``credentials`` if | |||
available (contributed by @daureg) | |||
<<<<<<< Updated upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI merge conflict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
Thank you @tswast ! 🚀 |
The BigQuery Storage API provides a way to read query results quickly
(and using multiple threads). It only works with large query results
(~125 MB), but as of 1.11.1, the google-cloud-bigquery library can
fallback to the BigQuery API to download results when a request to the
BigQuery Storage API fails.
As this API can increase costs (and may not be enabled on the user's
project), this option is disabled by default.