Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add progress for to_gbq function #166

Merged
merged 3 commits into from
Apr 27, 2018
Merged

Conversation

aktech
Copy link
Contributor

@aktech aktech commented Apr 19, 2018

Not sure, if we are fine with using a library (tqdm) or we would like to create our own progress bar.
Attempt for this : #162

TODO:

  • Add docstring about progress_bar
  • Add update in whatsnew
  • Use tqdm if its available and not if its not.
  • Add a test with progress_bar=True (its default now, test not needed?)

for remaining_rows in load.load_chunks(
self.client, dataframe, dataset_id, table_id,
chunksize=chunksize, schema=schema):
chunks = load.load_chunks(self.client, dataframe, dataset_id, table_id,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E501 line too long (83 > 79 characters)

@@ -861,7 +864,7 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,

def to_gbq(dataframe, destination_table, project_id, chunksize=None,
verbose=None, reauth=False, if_exists='fail', private_key=None,
auth_local_webserver=False, table_schema=None):
auth_local_webserver=False, table_schema=None, progress_bar=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this to the doc-string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sure. I'll add it, wanted to take feedback before making it ready for merging.

Copy link
Contributor

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I didn't know about the library but it looks good.

Could you please add a test with progress_bar=True?

Could you add a short note to whatsnew? Feel free to add yourself as the contributor

You could have a default of None, and then use tqdm if it's available and not if it's not. But no strong view from me.

@tswast is there anywhere we list optional dependencies?

Thanks @aktech

@aktech
Copy link
Contributor Author

aktech commented Apr 19, 2018

Thanks for the quick review @maxim-lian I will make the changes soon.

@@ -2,3 +2,4 @@ pandas
google-auth
google-auth-oauthlib
google-cloud-bigquery
tqdm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requirements.txt is only used by developers / CI testing. You need to add this to setup.py for it to actually get listed as a dependency.

As @maxim-lian says, we could make this an optional dependency. If you go that route, use the "extras" section of setup.py (example.

schema=schema)
if progress_bar:
from tqdm import tqdm
chunks = tqdm(chunks)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many cases there will be just one chunk, even for quite large dataframes. To be a useful progress indicator, this change should manually poll to see if the load job is complete as is done for query jobs:

https://github.com/pydata/pandas-gbq/blob/8f49ec3f134cc0a85f541116c42647c515fdc7e6/pandas_gbq/gbq.py#L504-L522

Copy link
Contributor Author

@aktech aktech Apr 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, tqdm can't do anything in this case. It's more useful for more number of chunks. The percentage done is based on the proportion of of chunks completed wrt the total number of chunks.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd caution against this change in that case. Load jobs are a limited resource (1,000 per day), so I'd rather encourage people to use fewer load jobs than more. Chunks should only be used when there is extreme memory pressure and a dataframe does not fit into memory if we try to serialize to CSV before upload.

@max-sixty
Copy link
Contributor

I just tried tqdm locally and it's v good - much nicer than the current logging. I vote that we at least enable-by-default-if-installed.

@aktech
Copy link
Contributor Author

aktech commented Apr 22, 2018

I just tried tqdm locally and it's v good - much nicer than the current logging. I vote that we at least enable-by-default-if-installed.

Indeed.

logger.info("\rLoad is {0}% Complete".format(
((total_rows - remaining_rows) * 100) / total_rows))
except self.http_error as ex:
self.process_http_error(ex)

logger.info("\n")

def _check_if_tqdm_exists(self):
try:
import tqdm

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

F401 'tqdm' imported but unused

logger.info("\rLoad is {0}% Complete".format(
((total_rows - remaining_rows) * 100) / total_rows))
except self.http_error as ex:
self.process_http_error(ex)

logger.info("\n")

def _check_if_tqdm_exists(self):
try:
import tqdm # noqa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this dependency is optional, there's no need to put it in a function. The only reason the google-cloud-bigquery imports are in a function is that they are required and it helps to make a better error message. Ideally there should be no errors if tqdm is not installed.

Example of optional import:

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/ecb501c1d8b099af1d804ad1cfd2cb96575faf3e/bigquery/google/cloud/bigquery/table.py#L24-L27

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(didn't see this before my comment)

logger.info("\rLoad is {0}% Complete".format(
((total_rows - remaining_rows) * 100) / total_rows))
except self.http_error as ex:
self.process_http_error(ex)

logger.info("\n")

def _check_if_tqdm_exists(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mean to slow you down with tweaks. That said, there's no need for this to be a method - you can have this inline like https://github.com/pandas-dev/pandas/blob/73085773998e12e85f1044771069f63f4e8d65ad/pandas/tests/generic/test_frame.py#L23

@max-sixty
Copy link
Contributor

@tswast do you know whether these are failures from master rather than @aktech 's work? https://travis-ci.org/pydata/pandas-gbq/jobs/370512748

@tswast
Copy link
Collaborator

tswast commented Apr 24, 2018

@maxim-lian Those failures look like a bad version of setuptools is failing to install the google-api-core namespaced package. Probably unrelated to this change.

if progress_bar and self._check_if_tqdm_exists():
from tqdm import tqdm
chunks = tqdm(chunks)
for remaining_rows in chunks:
logger.info("\rLoad is {0}% Complete".format(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aktech we can change this in a follow-up PR if you prefer, but I don't think we need to log in a loop we have tqdm.

From my local testing, printing anything while tqdm is running makes tqdm much worse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, indeed. I forgot to remove that.

@tswast
Copy link
Collaborator

tswast commented Apr 24, 2018

I actually don't think this change addresses #162 for the reasons I state in #166 (comment)

Since there is already a logging message for percentage based on chunks, I believe #162 is for the default (and encouraged) case when there is just one chunk. To show progress in the case of a single chunk, one must manually poll to see if the load job is complete as is done for query jobs:

https://github.com/pydata/pandas-gbq/blob/8f49ec3f134cc0a85f541116c42647c515fdc7e6/pandas_gbq/gbq.py#L504-L522

Why do I say a single chunk should be encouraged? Load jobs are a limited resource (1,000 per day), so I'd rather encourage people to use fewer load jobs than more. Chunks should only be used when there is extreme memory pressure and a dataframe does not fit into memory if we try to serialize to CSV before upload. Possibly we should even deprecate the chunks option for these reasons.

@max-sixty
Copy link
Contributor

Right, I think you could still have use this for one chunk - though it would be a counter rather than a progress bar (unless anyone can find a way of getting actual progress?)

Something like:

def wait_for_job(job, timeout_in_seconds=600):
    # https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/snippets.py
    start = datetime.datetime.now()
    timeout = start + datetime.timedelta(0, timeout_in_seconds)
    with tqdm(
            bar_format='Waiting for {desc} Elapsed: {elapsed}',
            total=10000) as progress:
        while True:
            job.reload()  # Refreshes the state via a GET request.
            progress.set_description(str(job))
            if job.state == 'DONE':
                if job.error_result:
                    raise RuntimeError(job.errors)
                progress.bar_format = 'Completed {decs}. Elapsed: {elapsed}'
                return
            if datetime.datetime.now() > timeout:
                raise IOError
            time.sleep(1)

@aktech aktech force-pushed the progress_bar branch 2 times, most recently from 2dfd111 to 2d69758 Compare April 27, 2018 15:10
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for the update.

One last thing: could you add to the changelog

@aktech
Copy link
Contributor Author

aktech commented Apr 27, 2018

@tswast The progress of an individual chunk has not been added yet. Do you want to merge it without it?

schema=schema)
if progress_bar and tqdm:
chunks = tqdm.tqdm(chunks)
for remaining_rows in chunks:
logger.info("\rLoad is {0}% Complete".format(
Copy link
Contributor

@max-sixty max-sixty Apr 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were going to either tqdm or log, rather than both?
But future PR is OK if you want to get this in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can do it in another PR.

@tswast
Copy link
Collaborator

tswast commented Apr 27, 2018

@aktech Yes, we can merge without it since individual chunk project is different enough from this change to warrant a second PR.

@max-sixty
Copy link
Contributor

Looks like GH thought I was still requesting changes. I just clicked the button.

@aktech we'll merge now and then can add any improvements in the future?

@aktech
Copy link
Contributor Author

aktech commented Apr 27, 2018

@maxim-lian @tswast sure.

@tswast tswast merged commit d038ace into googleapis:master Apr 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants