Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure table_schema arg is not modified inplace #278

Merged
merged 1 commit into from
May 29, 2019
Merged

Ensure table_schema arg is not modified inplace #278

merged 1 commit into from
May 29, 2019

Conversation

bsolomon1124
Copy link
Contributor

@bsolomon1124 bsolomon1124 commented May 28, 2019

Fixes #277.

If any dictionary entry in the table_schema arg did
not contain a "mode" key, a mode="NULLABLE" kv pair
would be created; because schema.update_schema()
returns an object with references to its mutable input,
this allows the argument to ultimately be modified
by the function rather than the caller.

This pattern was used in both gbq.py and load.py,
so it was refactored into a helper function in
schema.py, which now returns a modified copy.
Deepcopy is used because the input is a list of
dictionaries, so a shallow copy would be insufficient.

Copy link
Contributor

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @bsolomon1124

Re the test failing: are we sure the schema isn't change upstream of load_chunks? I can look more if it's a quandary.

tests/unit/test_gbq.py Show resolved Hide resolved
pandas_gbq/load.py Outdated Show resolved Hide resolved
@bsolomon1124
Copy link
Contributor Author

bsolomon1124 commented May 28, 2019

So, I might be thinking too hard about this one @tswast / @max-sixty , but this seems to only occur if the GBQ table does not exist. Giving myself a headache after a long day at work..

Case 1: Called when dataset.schematest does not exist and will be created; in this first case, the argument gets modified:

In [17]: original_schema = [ 
    ...:     {"name": "field1", "type": "STRING", "mode": "REQUIRED"}, 
    ...:     {"name": "field2", "type": "INTEGER"}, 
    ...:     {"name": "field3", "type": "DATE"}, 
    ...: ] 
    ...: gbq.to_gbq(df, "dataset.schematest", project_id="xxx", table_schema=original_schema, if_exists="append")                                  

In [18]: original_schema                                                                                                                                     
Out[18]: 
[{'name': 'field1', 'type': 'STRING', 'mode': 'REQUIRED'},
 {'name': 'field2', 'type': 'INTEGER', 'mode': 'NULLABLE'},
 {'name': 'field3', 'type': 'DATE', 'mode': 'NULLABLE'}]

Now, called again directly after the above, when table exists; seemingly no problem here:

In [19]: df = DataFrame( 
    ...:     { 
    ...:         "field1": ["a", "b"], 
    ...:         "field2": [1, 2], 
    ...:         "field3": [datetime.date(2019, 1, 1), datetime.date(2019, 5, 1)], 
    ...:     } 
    ...: ) 
    ...: original_schema = [ 
    ...:     {"name": "field1", "type": "STRING", "mode": "REQUIRED"}, 
    ...:     {"name": "field2", "type": "INTEGER"}, 
    ...:     {"name": "field3", "type": "DATE"}, 
    ...: ] 
    ...: gbq.to_gbq(df, "dataset.schematest", project_id="xxx", table_schema=original_schema, if_exists="append")                                  

In [20]: original_schema                                                                                                                                     
Out[20]: 
[{'name': 'field1', 'type': 'STRING', 'mode': 'REQUIRED'},
 {'name': 'field2', 'type': 'INTEGER'},
 {'name': 'field3', 'type': 'DATE'}]

...which would seem to imply something funky is happening here: https://github.com/bsolomon1124/pandas-gbq/blob/488f6b7e984757c79bbc4f59b1121fcd6e3afbbf/pandas_gbq/gbq.py#L1124, in the call to table.create().

@max-sixty
Copy link
Contributor

I think no function in this library should modify its arguments.

So we should do the deepcopy in the function that does modify them - can we track that down and apply it to that function? I think that's the way of applying @bsolomon1124 's good principle of fixing the problem at its root.

@bsolomon1124
Copy link
Contributor Author

Aha. It looks like the same pattern is repeated in https://github.com/bsolomon1124/pandas-gbq/blob/488f6b7e984757c79bbc4f59b1121fcd6e3afbbf/pandas_gbq/gbq.py#L1272. (Seeing some similar patterns in gbq.py but not sure if they are doing the same thing; see grep -nr "for field in schema" .

Fixes #277.

If any dictionary entry in the `table_schema` arg did
not contain a "mode" key, a mode="NULLABLE" kv pair
would be created; because `schema.update_schema()`
returns an object with references to its mutable input,
this allows the argument to ultimately be modified
by the function rather than the caller.

This pattern was used in both gbq.py and load.py,
so it was refactored into a helper function in
schema.py, which now returns a modified *copy*.
Deepcopy is used because the input is a list of
dictionaries, so a shallow copy would be insufficient.
@bsolomon1124
Copy link
Contributor Author

bsolomon1124 commented May 29, 2019

Ok, I think we've got it @max-sixty @tswast . See the note above regarding gbq.py and the updated commit msg, which I've squashed into one.

I do wonder if there are other more subtle cases like this hidden in gbq.py.

Also, I think that update_schema() may be returning data that references output_fields:

In other words, when return {"fields": output_fields} gets returned, output_fields is composed of elements of new_fields, which is in turn schema_new["fields"]. If any caller were to modify the result of update_schema(a, b), they might still be inadvertently modifying b. (We've 'bandaged' that by deep-copying the result of update_schema.)

@max-sixty
Copy link
Contributor

That's perfect! Ideal design and implementation.

I'll merge tomorrow unless anyone has any thoughts.

Re the output_fields - I agree, I think it would be better if that returned a new object without any objects in schema_old. I think it's less likely to hit than the current issue, but still worth adding a copy there. If you want to add a TODO / issue, please do; but don't feel we have to wait to fix to merge this.

@max-sixty max-sixty merged commit 9460ad6 into googleapis:master May 29, 2019
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@bsolomon1124 bsolomon1124 deleted the do-not-modify-schema-inplace branch May 29, 2019 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_gbq should not modify table_schema inplace
3 participants