-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add progress for to_gbq function #166
Conversation
pandas_gbq/gbq.py
Outdated
for remaining_rows in load.load_chunks( | ||
self.client, dataframe, dataset_id, table_id, | ||
chunksize=chunksize, schema=schema): | ||
chunks = load.load_chunks(self.client, dataframe, dataset_id, table_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E501 line too long (83 > 79 characters)
pandas_gbq/gbq.py
Outdated
@@ -861,7 +864,7 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None, | |||
|
|||
def to_gbq(dataframe, destination_table, project_id, chunksize=None, | |||
verbose=None, reauth=False, if_exists='fail', private_key=None, | |||
auth_local_webserver=False, table_schema=None): | |||
auth_local_webserver=False, table_schema=None, progress_bar=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add this to the doc-string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sure. I'll add it, wanted to take feedback before making it ready for merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I didn't know about the library but it looks good.
Could you please add a test with progress_bar=True
?
Could you add a short note to whatsnew? Feel free to add yourself as the contributor
You could have a default of None
, and then use tqdm
if it's available and not if it's not. But no strong view from me.
@tswast is there anywhere we list optional dependencies?
Thanks @aktech
Thanks for the quick review @maxim-lian I will make the changes soon. |
@@ -2,3 +2,4 @@ pandas | |||
google-auth | |||
google-auth-oauthlib | |||
google-cloud-bigquery | |||
tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requirements.txt is only used by developers / CI testing. You need to add this to setup.py
for it to actually get listed as a dependency.
As @maxim-lian says, we could make this an optional dependency. If you go that route, use the "extras" section of setup.py (example.
pandas_gbq/gbq.py
Outdated
schema=schema) | ||
if progress_bar: | ||
from tqdm import tqdm | ||
chunks = tqdm(chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In many cases there will be just one chunk, even for quite large dataframes. To be a useful progress indicator, this change should manually poll to see if the load job is complete as is done for query jobs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, tqdm can't do anything in this case. It's more useful for more number of chunks. The percentage done is based on the proportion of of chunks completed wrt the total number of chunks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd caution against this change in that case. Load jobs are a limited resource (1,000 per day), so I'd rather encourage people to use fewer load jobs than more. Chunks should only be used when there is extreme memory pressure and a dataframe does not fit into memory if we try to serialize to CSV before upload.
I just tried tqdm locally and it's v good - much nicer than the current logging. I vote that we at least enable-by-default-if-installed. |
Indeed. |
pandas_gbq/gbq.py
Outdated
logger.info("\rLoad is {0}% Complete".format( | ||
((total_rows - remaining_rows) * 100) / total_rows)) | ||
except self.http_error as ex: | ||
self.process_http_error(ex) | ||
|
||
logger.info("\n") | ||
|
||
def _check_if_tqdm_exists(self): | ||
try: | ||
import tqdm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
F401 'tqdm' imported but unused
pandas_gbq/gbq.py
Outdated
logger.info("\rLoad is {0}% Complete".format( | ||
((total_rows - remaining_rows) * 100) / total_rows)) | ||
except self.http_error as ex: | ||
self.process_http_error(ex) | ||
|
||
logger.info("\n") | ||
|
||
def _check_if_tqdm_exists(self): | ||
try: | ||
import tqdm # noqa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this dependency is optional, there's no need to put it in a function. The only reason the google-cloud-bigquery
imports are in a function is that they are required and it helps to make a better error message. Ideally there should be no errors if tqdm is not installed.
Example of optional import:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(didn't see this before my comment)
pandas_gbq/gbq.py
Outdated
logger.info("\rLoad is {0}% Complete".format( | ||
((total_rows - remaining_rows) * 100) / total_rows)) | ||
except self.http_error as ex: | ||
self.process_http_error(ex) | ||
|
||
logger.info("\n") | ||
|
||
def _check_if_tqdm_exists(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mean to slow you down with tweaks. That said, there's no need for this to be a method - you can have this inline like https://github.com/pandas-dev/pandas/blob/73085773998e12e85f1044771069f63f4e8d65ad/pandas/tests/generic/test_frame.py#L23
@tswast do you know whether these are failures from master rather than @aktech 's work? https://travis-ci.org/pydata/pandas-gbq/jobs/370512748 |
@maxim-lian Those failures look like a bad version of |
pandas_gbq/gbq.py
Outdated
if progress_bar and self._check_if_tqdm_exists(): | ||
from tqdm import tqdm | ||
chunks = tqdm(chunks) | ||
for remaining_rows in chunks: | ||
logger.info("\rLoad is {0}% Complete".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aktech we can change this in a follow-up PR if you prefer, but I don't think we need to log in a loop we have tqdm.
From my local testing, printing anything while tqdm is running makes tqdm much worse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, indeed. I forgot to remove that.
I actually don't think this change addresses #162 for the reasons I state in #166 (comment) Since there is already a logging message for percentage based on chunks, I believe #162 is for the default (and encouraged) case when there is just one chunk. To show progress in the case of a single chunk, one must manually poll to see if the load job is complete as is done for query jobs: Why do I say a single chunk should be encouraged? Load jobs are a limited resource (1,000 per day), so I'd rather encourage people to use fewer load jobs than more. Chunks should only be used when there is extreme memory pressure and a dataframe does not fit into memory if we try to serialize to CSV before upload. Possibly we should even deprecate the chunks option for these reasons. |
Right, I think you could still have use this for one chunk - though it would be a counter rather than a progress bar (unless anyone can find a way of getting actual progress?) Something like: def wait_for_job(job, timeout_in_seconds=600):
# https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/snippets.py
start = datetime.datetime.now()
timeout = start + datetime.timedelta(0, timeout_in_seconds)
with tqdm(
bar_format='Waiting for {desc} Elapsed: {elapsed}',
total=10000) as progress:
while True:
job.reload() # Refreshes the state via a GET request.
progress.set_description(str(job))
if job.state == 'DONE':
if job.error_result:
raise RuntimeError(job.errors)
progress.bar_format = 'Completed {decs}. Elapsed: {elapsed}'
return
if datetime.datetime.now() > timeout:
raise IOError
time.sleep(1) |
2dfd111
to
2d69758
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for the update.
One last thing: could you add to the changelog
@tswast The progress of an individual chunk has not been added yet. Do you want to merge it without it? |
schema=schema) | ||
if progress_bar and tqdm: | ||
chunks = tqdm.tqdm(chunks) | ||
for remaining_rows in chunks: | ||
logger.info("\rLoad is {0}% Complete".format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were going to either tqdm
or log, rather than both?
But future PR is OK if you want to get this in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can do it in another PR.
@aktech Yes, we can merge without it since individual chunk project is different enough from this change to warrant a second PR. |
Looks like GH thought I was still requesting changes. I just clicked the button. @aktech we'll merge now and then can add any improvements in the future? |
@maxim-lian @tswast sure. |
Not sure, if we are fine with using a library (
tqdm
) or we would like to create our own progress bar.Attempt for this : #162
TODO:
progress_bar
progress_bar=True
(its default now, test not needed?)