Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add use_bqstorage_api option to read_gbq #270

Merged
merged 3 commits into from
Apr 5, 2019

Conversation

tswast
Copy link
Collaborator

@tswast tswast commented Apr 4, 2019

The BigQuery Storage API provides a way to read query results quickly
(and using multiple threads). It only works with large query results
(~125 MB), but as of 1.11.1, the google-cloud-bigquery library can
fallback to the BigQuery API to download results when a request to the
BigQuery Storage API fails.

As this API can increase costs (and may not be enabled on the user's
project), this option is disabled by default.

tswast added 2 commits April 4, 2019 17:43
The BigQuery Storage API provides a way to read query results quickly
(and using multiple threads). It only works with large query results
(~125 MB), but as of 1.11.1, the google-cloud-bigquery library can
fallback to the BigQuery API to download results when a request to the
BigQuery Storage API fails.

As this API can increase costs (and may not be enabled on the user's
project), this option is disabled by default.
@tswast tswast force-pushed the issue133-performance-bqstorage branch from ed4aaf2 to 5b34b28 Compare April 4, 2019 22:45
@tswast
Copy link
Collaborator Author

tswast commented Apr 4, 2019

FYI: I ran the test I added with the BQ Storage API on a 4 CPU-node instance.

  • use_bqstorage_api=True : 76.62 seconds (63.48 seconds after query results are cached)
  • use_bqstorage_api=False : 265.41 seconds

Copy link
Contributor

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

<https://github.com/googleapis/google-cloud-python/pull/7633>`__
(fixed in version 1.11.0), you must write your query results to a
destination table. To do this with ``read_gbq``, supply a
``configuration`` dictionary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could only allow importing this for versions >1.11.0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts on this:

  • I'd be more comfortable bumping the required version once there's been a few more versions of google-cloud-bigquery released past 1.11. I like people to have a few options in case there's a problematic regression in 1.11.1.
  • The failure on small results will be quite obvious (InternalError stacktrace) if they do try to use with an older version. If someone really wants to try to use it with an older version, they'll find out pretty quickly why they should upgrade, or they really know what they are doing and that they are always going to get results that are large enough for the BigQuery Storage API to read from.
  • Someday we might want to make use_bqstorage_api=True the default (maybe after the BQ Storage API goes GA?). I think it'd definitely be worth bumping the minimum version at that time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds v reasonable. Agree re making the default True when it's GA (or at least soliciting opinions)

tbc I was suggesting only for using bqstorage do we make the min version 1.11 (probably implemented above when attempting the import). Other versions would still run everything else

@@ -727,6 +740,21 @@ def _localize_df(schema_fields, df):
return df


def _make_bqstorage_client(use_bqstorage_api, credentials):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this take a project too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BigQuery Storage API client doesn't actually take a project. The project is set when creating a BQ Storage API read session by to_dataframe based on the BQ client's project.

Aside: The BQ Storage API client is partially auto-generated, and the client generator doesn't support a default project ID. Since I'm considering the BQ Storage API a lower-level client that's used by the normal BQ client from to_dataframe, I felt it wasn't worth adding in that kind of handwritten helper to the autogenned layer.

@@ -39,8 +39,16 @@ Enhancements
(contributed by @johnpaton)
- Read ``project_id`` in :func:`to_gbq` from provided ``credentials`` if
available (contributed by @daureg)
<<<<<<< Updated upstream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI merge conflict

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks!

@tswast tswast marked this pull request as ready for review April 5, 2019 13:03
@tswast tswast merged commit 5e80908 into googleapis:master Apr 5, 2019
@tswast tswast deleted the issue133-performance-bqstorage branch April 5, 2019 13:13
@max-sixty
Copy link
Contributor

Thank you @tswast ! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants