Add progress for to_gbq function #166

aktech · 2018-04-19T16:17:54Z

Not sure, if we are fine with using a library (tqdm) or we would like to create our own progress bar.
Attempt for this : #162

TODO:

Add docstring about progress_bar
Add update in whatsnew
Use tqdm if its available and not if its not.
Add a test with progress_bar=True (its default now, test not needed?)

stickler-ci · 2018-04-19T16:18:05Z

pandas_gbq/gbq.py

-            for remaining_rows in load.load_chunks(
-                    self.client, dataframe, dataset_id, table_id,
-                    chunksize=chunksize, schema=schema):
+            chunks = load.load_chunks(self.client, dataframe, dataset_id, table_id,


E501 line too long (83 > 79 characters)

max-sixty · 2018-04-19T16:20:28Z

pandas_gbq/gbq.py

@@ -861,7 +864,7 @@ def read_gbq(query, project_id=None, index_col=None, col_order=None,

 def to_gbq(dataframe, destination_table, project_id, chunksize=None,
           verbose=None, reauth=False, if_exists='fail', private_key=None,
-           auth_local_webserver=False, table_schema=None):
+           auth_local_webserver=False, table_schema=None, progress_bar=False):


Could you add this to the doc-string?

Yes, sure. I'll add it, wanted to take feedback before making it ready for merging.

max-sixty

This looks great! I didn't know about the library but it looks good.

Could you please add a test with progress_bar=True?

Could you add a short note to whatsnew? Feel free to add yourself as the contributor

You could have a default of None, and then use tqdm if it's available and not if it's not. But no strong view from me.

@tswast is there anywhere we list optional dependencies?

Thanks @aktech

aktech · 2018-04-19T16:27:12Z

Thanks for the quick review @maxim-lian I will make the changes soon.

tswast · 2018-04-19T16:29:47Z

requirements.txt

@@ -2,3 +2,4 @@ pandas
 google-auth
 google-auth-oauthlib
 google-cloud-bigquery
+tqdm


requirements.txt is only used by developers / CI testing. You need to add this to setup.py for it to actually get listed as a dependency.

As @maxim-lian says, we could make this an optional dependency. If you go that route, use the "extras" section of setup.py (example.

tswast · 2018-04-19T16:30:52Z

pandas_gbq/gbq.py

+                                      schema=schema)
+            if progress_bar:
+                from tqdm import tqdm
+                chunks = tqdm(chunks)


In many cases there will be just one chunk, even for quite large dataframes. To be a useful progress indicator, this change should manually poll to see if the load job is complete as is done for query jobs:

https://github.com/pydata/pandas-gbq/blob/8f49ec3f134cc0a85f541116c42647c515fdc7e6/pandas_gbq/gbq.py#L504-L522

Well, tqdm can't do anything in this case. It's more useful for more number of chunks. The percentage done is based on the proportion of of chunks completed wrt the total number of chunks.

I'd caution against this change in that case. Load jobs are a limited resource (1,000 per day), so I'd rather encourage people to use fewer load jobs than more. Chunks should only be used when there is extreme memory pressure and a dataframe does not fit into memory if we try to serialize to CSV before upload.

max-sixty · 2018-04-22T18:57:42Z

I just tried tqdm locally and it's v good - much nicer than the current logging. I vote that we at least enable-by-default-if-installed.

aktech · 2018-04-22T22:45:46Z

I just tried tqdm locally and it's v good - much nicer than the current logging. I vote that we at least enable-by-default-if-installed.

Indeed.

stickler-ci · 2018-04-24T10:44:32Z

pandas_gbq/gbq.py

                logger.info("\rLoad is {0}% Complete".format(
                    ((total_rows - remaining_rows) * 100) / total_rows))
        except self.http_error as ex:
            self.process_http_error(ex)

        logger.info("\n")

+    def _check_if_tqdm_exists(self):
+        try:
+            import tqdm


F401 'tqdm' imported but unused

tswast · 2018-04-24T16:54:38Z

pandas_gbq/gbq.py

                logger.info("\rLoad is {0}% Complete".format(
                    ((total_rows - remaining_rows) * 100) / total_rows))
        except self.http_error as ex:
            self.process_http_error(ex)

        logger.info("\n")

+    def _check_if_tqdm_exists(self):
+        try:
+            import tqdm  # noqa


Since this dependency is optional, there's no need to put it in a function. The only reason the google-cloud-bigquery imports are in a function is that they are required and it helps to make a better error message. Ideally there should be no errors if tqdm is not installed.

Example of optional import:

https://github.com/GoogleCloudPlatform/google-cloud-python/blob/ecb501c1d8b099af1d804ad1cfd2cb96575faf3e/bigquery/google/cloud/bigquery/table.py#L24-L27

(didn't see this before my comment)

max-sixty · 2018-04-24T17:10:35Z

pandas_gbq/gbq.py

                logger.info("\rLoad is {0}% Complete".format(
                    ((total_rows - remaining_rows) * 100) / total_rows))
        except self.http_error as ex:
            self.process_http_error(ex)

        logger.info("\n")

+    def _check_if_tqdm_exists(self):


I don't mean to slow you down with tweaks. That said, there's no need for this to be a method - you can have this inline like https://github.com/pandas-dev/pandas/blob/73085773998e12e85f1044771069f63f4e8d65ad/pandas/tests/generic/test_frame.py#L23

max-sixty · 2018-04-24T17:11:26Z

@tswast do you know whether these are failures from master rather than @aktech 's work? https://travis-ci.org/pydata/pandas-gbq/jobs/370512748

tswast · 2018-04-24T17:16:57Z

@maxim-lian Those failures look like a bad version of setuptools is failing to install the google-api-core namespaced package. Probably unrelated to this change.

max-sixty · 2018-04-24T17:22:33Z

pandas_gbq/gbq.py

+            if progress_bar and self._check_if_tqdm_exists():
+                from tqdm import tqdm
+                chunks = tqdm(chunks)
+            for remaining_rows in chunks:
                logger.info("\rLoad is {0}% Complete".format(


@aktech we can change this in a follow-up PR if you prefer, but I don't think we need to log in a loop we have tqdm.

From my local testing, printing anything while tqdm is running makes tqdm much worse

Yeah, indeed. I forgot to remove that.

tswast · 2018-04-24T17:30:04Z

I actually don't think this change addresses #162 for the reasons I state in #166 (comment)

Since there is already a logging message for percentage based on chunks, I believe #162 is for the default (and encouraged) case when there is just one chunk. To show progress in the case of a single chunk, one must manually poll to see if the load job is complete as is done for query jobs:

https://github.com/pydata/pandas-gbq/blob/8f49ec3f134cc0a85f541116c42647c515fdc7e6/pandas_gbq/gbq.py#L504-L522

Why do I say a single chunk should be encouraged? Load jobs are a limited resource (1,000 per day), so I'd rather encourage people to use fewer load jobs than more. Chunks should only be used when there is extreme memory pressure and a dataframe does not fit into memory if we try to serialize to CSV before upload. Possibly we should even deprecate the chunks option for these reasons.

max-sixty · 2018-04-25T05:59:54Z

Right, I think you could still have use this for one chunk - though it would be a counter rather than a progress bar (unless anyone can find a way of getting actual progress?)

Something like:

def wait_for_job(job, timeout_in_seconds=600):
    # https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/snippets.py
    start = datetime.datetime.now()
    timeout = start + datetime.timedelta(0, timeout_in_seconds)
    with tqdm(
            bar_format='Waiting for {desc} Elapsed: {elapsed}',
            total=10000) as progress:
        while True:
            job.reload()  # Refreshes the state via a GET request.
            progress.set_description(str(job))
            if job.state == 'DONE':
                if job.error_result:
                    raise RuntimeError(job.errors)
                progress.bar_format = 'Completed {decs}. Elapsed: {elapsed}'
                return
            if datetime.datetime.now() > timeout:
                raise IOError
            time.sleep(1)

tswast

LGTM. Thanks for the update.

One last thing: could you add to the changelog

aktech · 2018-04-27T16:37:43Z

@tswast The progress of an individual chunk has not been added yet. Do you want to merge it without it?

max-sixty · 2018-04-27T16:44:32Z

pandas_gbq/gbq.py

+                                      schema=schema)
+            if progress_bar and tqdm:
+                chunks = tqdm.tqdm(chunks)
+            for remaining_rows in chunks:
                logger.info("\rLoad is {0}% Complete".format(


I thought we were going to either tqdm or log, rather than both?
But future PR is OK if you want to get this in

Yes, we can do it in another PR.

tswast · 2018-04-27T17:30:47Z

@aktech Yes, we can merge without it since individual chunk project is different enough from this change to warrant a second PR.

max-sixty · 2018-04-27T18:08:00Z

Looks like GH thought I was still requesting changes. I just clicked the button.

@aktech we'll merge now and then can add any improvements in the future?

aktech · 2018-04-27T21:09:43Z

@maxim-lian @tswast sure.

stickler-ci reviewed Apr 19, 2018

View reviewed changes

max-sixty reviewed Apr 19, 2018

View reviewed changes

aktech force-pushed the progress_bar branch from d7b8d71 to 0dfe4e3 Compare April 19, 2018 16:21

max-sixty suggested changes Apr 19, 2018

View reviewed changes

tswast reviewed Apr 19, 2018

View reviewed changes

stickler-ci reviewed Apr 24, 2018

View reviewed changes

aktech force-pushed the progress_bar branch from 2cfc898 to 23c98b0 Compare April 24, 2018 10:46

tswast reviewed Apr 24, 2018

View reviewed changes

max-sixty reviewed Apr 24, 2018

View reviewed changes

aktech force-pushed the progress_bar branch 2 times, most recently from 2dfd111 to 2d69758 Compare April 27, 2018 15:10

tswast reviewed Apr 27, 2018

View reviewed changes

max-sixty reviewed Apr 27, 2018

View reviewed changes

max-sixty approved these changes Apr 27, 2018

View reviewed changes

Amit Kumar and others added 2 commits April 27, 2018 22:07

Add progress for to_gbq function using tqdm

8b00e9c

Add tqdm addition to changelog

3a5dc2c

aktech force-pushed the progress_bar branch from 2d69758 to 3a5dc2c Compare April 27, 2018 21:08

Add issue number to changelog for progress bar

44e7cac

tswast merged commit d038ace into googleapis:master Apr 27, 2018

tswast mentioned this pull request May 23, 2018

ENH: show progress while fetching rows for query #182

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add progress for to_gbq function #166

Add progress for to_gbq function #166

aktech commented Apr 19, 2018 •

edited

Loading

stickler-ci Apr 19, 2018

max-sixty Apr 19, 2018

aktech Apr 19, 2018

max-sixty left a comment

aktech commented Apr 19, 2018

tswast Apr 19, 2018

tswast Apr 19, 2018

aktech Apr 24, 2018 •

edited

Loading

tswast Apr 24, 2018

max-sixty commented Apr 22, 2018

aktech commented Apr 22, 2018

stickler-ci Apr 24, 2018

tswast Apr 24, 2018

max-sixty Apr 24, 2018

max-sixty Apr 24, 2018

max-sixty commented Apr 24, 2018

tswast commented Apr 24, 2018

max-sixty Apr 24, 2018

aktech Apr 25, 2018

tswast commented Apr 24, 2018 •

edited

Loading

max-sixty commented Apr 25, 2018

tswast left a comment

aktech commented Apr 27, 2018

max-sixty Apr 27, 2018 •

edited

Loading

aktech Apr 27, 2018

tswast commented Apr 27, 2018

max-sixty commented Apr 27, 2018

aktech commented Apr 27, 2018

Add progress for to_gbq function #166

Add progress for to_gbq function #166

Conversation

aktech commented Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty left a comment

Choose a reason for hiding this comment

aktech commented Apr 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aktech Apr 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Apr 22, 2018

aktech commented Apr 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

max-sixty commented Apr 24, 2018

tswast commented Apr 24, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented Apr 24, 2018 • edited Loading

max-sixty commented Apr 25, 2018

tswast left a comment

Choose a reason for hiding this comment

aktech commented Apr 27, 2018

max-sixty Apr 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast commented Apr 27, 2018

max-sixty commented Apr 27, 2018

aktech commented Apr 27, 2018

aktech commented Apr 19, 2018 •

edited

Loading

aktech Apr 24, 2018 •

edited

Loading

tswast commented Apr 24, 2018 •

edited

Loading

max-sixty Apr 27, 2018 •

edited

Loading