Ensure table_schema arg is not modified inplace #278

bsolomon1124 · 2019-05-28T19:54:30Z

Fixes #277.

If any dictionary entry in the table_schema arg did
not contain a "mode" key, a mode="NULLABLE" kv pair
would be created; because schema.update_schema()
returns an object with references to its mutable input,
this allows the argument to ultimately be modified
by the function rather than the caller.

This pattern was used in both gbq.py and load.py,
so it was refactored into a helper function in
schema.py, which now returns a modified copy.
Deepcopy is used because the input is a list of
dictionaries, so a shallow copy would be insufficient.

max-sixty

Looks great, thanks @bsolomon1124

Re the test failing: are we sure the schema isn't change upstream of load_chunks? I can look more if it's a quandary.

tests/unit/test_gbq.py

pandas_gbq/load.py

bsolomon1124 · 2019-05-28T22:54:32Z

So, I might be thinking too hard about this one @tswast / @max-sixty , but this seems to only occur if the GBQ table does not exist. Giving myself a headache after a long day at work..

Case 1: Called when dataset.schematest does not exist and will be created; in this first case, the argument gets modified:

In [17]: original_schema = [ 
    ...:     {"name": "field1", "type": "STRING", "mode": "REQUIRED"}, 
    ...:     {"name": "field2", "type": "INTEGER"}, 
    ...:     {"name": "field3", "type": "DATE"}, 
    ...: ] 
    ...: gbq.to_gbq(df, "dataset.schematest", project_id="xxx", table_schema=original_schema, if_exists="append")                                  

In [18]: original_schema                                                                                                                                     
Out[18]: 
[{'name': 'field1', 'type': 'STRING', 'mode': 'REQUIRED'},
 {'name': 'field2', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'field3', 'type': 'DATE', 'mode': 'NULLABLE'}]

Now, called again directly after the above, when table exists; seemingly no problem here:

In [19]: df = DataFrame( 
    ...:     { 
    ...:         "field1": ["a", "b"], 
    ...:         "field2": [1, 2], 
    ...:         "field3": [datetime.date(2019, 1, 1), datetime.date(2019, 5, 1)], 
    ...:     } 
    ...: ) 
    ...: original_schema = [ 
    ...:     {"name": "field1", "type": "STRING", "mode": "REQUIRED"}, 
    ...:     {"name": "field2", "type": "INTEGER"}, 
    ...:     {"name": "field3", "type": "DATE"}, 
    ...: ] 
    ...: gbq.to_gbq(df, "dataset.schematest", project_id="xxx", table_schema=original_schema, if_exists="append")                                  

In [20]: original_schema                                                                                                                                     
Out[20]: 
[{'name': 'field1', 'type': 'STRING', 'mode': 'REQUIRED'},
 {'name': 'field2', 'type': 'INTEGER'},
 {'name': 'field3', 'type': 'DATE'}]

...which would seem to imply something funky is happening here: https://github.com/bsolomon1124/pandas-gbq/blob/488f6b7e984757c79bbc4f59b1121fcd6e3afbbf/pandas_gbq/gbq.py#L1124, in the call to table.create().

max-sixty · 2019-05-28T22:58:14Z

I think no function in this library should modify its arguments.

So we should do the deepcopy in the function that does modify them - can we track that down and apply it to that function? I think that's the way of applying @bsolomon1124 's good principle of fixing the problem at its root.

bsolomon1124 · 2019-05-29T00:03:41Z

Aha. It looks like the same pattern is repeated in https://github.com/bsolomon1124/pandas-gbq/blob/488f6b7e984757c79bbc4f59b1121fcd6e3afbbf/pandas_gbq/gbq.py#L1272. (Seeing some similar patterns in gbq.py but not sure if they are doing the same thing; see grep -nr "for field in schema" .

Fixes #277. If any dictionary entry in the `table_schema` arg did not contain a "mode" key, a mode="NULLABLE" kv pair would be created; because `schema.update_schema()` returns an object with references to its mutable input, this allows the argument to ultimately be modified by the function rather than the caller. This pattern was used in both gbq.py and load.py, so it was refactored into a helper function in schema.py, which now returns a modified *copy*. Deepcopy is used because the input is a list of dictionaries, so a shallow copy would be insufficient.

bsolomon1124 · 2019-05-29T00:34:27Z

Ok, I think we've got it @max-sixty @tswast . See the note above regarding gbq.py and the updated commit msg, which I've squashed into one.

I do wonder if there are other more subtle cases like this hidden in gbq.py.

Also, I think that update_schema() may be returning data that references output_fields:

In other words, when return {"fields": output_fields} gets returned, output_fields is composed of elements of new_fields, which is in turn schema_new["fields"]. If any caller were to modify the result of update_schema(a, b), they might still be inadvertently modifying b. (We've 'bandaged' that by deep-copying the result of update_schema.)

max-sixty · 2019-05-29T01:17:51Z

That's perfect! Ideal design and implementation.

I'll merge tomorrow unless anyone has any thoughts.

Re the output_fields - I agree, I think it would be better if that returned a new object without any objects in schema_old. I think it's less likely to hit than the current issue, but still worth adding a copy there. If you want to add a TODO / issue, please do; but don't feel we have to wait to fix to merge this.

tswast

LGTM. Thanks!

bsolomon1124 mentioned this pull request May 28, 2019

to_gbq should not modify table_schema inplace #277

Closed

max-sixty reviewed May 28, 2019

View reviewed changes

tests/unit/test_gbq.py Show resolved Hide resolved

tswast reviewed May 28, 2019

View reviewed changes

pandas_gbq/load.py Outdated Show resolved Hide resolved

max-sixty merged commit 9460ad6 into googleapis:master May 29, 2019

tswast reviewed May 29, 2019

View reviewed changes

bsolomon1124 deleted the do-not-modify-schema-inplace branch May 29, 2019 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure table_schema arg is not modified inplace #278

Ensure table_schema arg is not modified inplace #278

bsolomon1124 commented May 28, 2019 •

edited

Loading

max-sixty left a comment

bsolomon1124 commented May 28, 2019 •

edited

Loading

max-sixty commented May 28, 2019

bsolomon1124 commented May 29, 2019

bsolomon1124 commented May 29, 2019 •

edited

Loading

max-sixty commented May 29, 2019

tswast left a comment

Ensure table_schema arg is not modified inplace #278

Ensure table_schema arg is not modified inplace #278

Conversation

bsolomon1124 commented May 28, 2019 • edited Loading

max-sixty left a comment

Choose a reason for hiding this comment

bsolomon1124 commented May 28, 2019 • edited Loading

max-sixty commented May 28, 2019

bsolomon1124 commented May 29, 2019

bsolomon1124 commented May 29, 2019 • edited Loading

max-sixty commented May 29, 2019

tswast left a comment

Choose a reason for hiding this comment

bsolomon1124 commented May 28, 2019 •

edited

Loading

bsolomon1124 commented May 28, 2019 •

edited

Loading

bsolomon1124 commented May 29, 2019 •

edited

Loading