-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
attempting to change existing datatype on BQ table through table_schema
and if_exists="replace"
fails
#670
Comments
+1 |
I believe this is due to an extra call to It appears that pandas-gbq now reads the existing schema from the destination table and overwrites the fields/datatypes of the supplied |
…ists="replace" (googleapis#670) A previous issue (googleapis#315) noted that when using `if_exists="append"`, pandas should retrieve the existing schema from BQ and update the user-supplied `table_schema` to ensure that all column modes (REQUIRED/NULLABLE) match those of the existing table. The commit that fixed this issue applies this policy to *all* write_dispositions, but it is only applicable when appending. When replacing, the user should be able to overwrite with new datatypes.
Further digging revealed that a previous issue (#315) noted that when using The commit that fixed this issue applies this policy to all write_dispositions, but it is only applicable when appending. IMO, when replacing an entire table, the user should be able to replace the datatypes as well. This could be resolved by adding
|
Hi @Imperssonator, thank you for reporting and digging into the issue! I see that you have created a commit to resolve this problem, and I think it's a pretty valid solution. Would you like to create a PR with it? IF yes, generally we want new code to also include at least a unit test. Do you feel comfortable adding a unit test to the PR? (If not, that is fine, we can work together to flesh out the PR fully). |
@Linchin I'm happy to take a stab at that, although I can't say when I would have time to do so - hopefully in the next month or two. If you need this integrated more quickly, probably not best to wait on me. I'm operating on a fork so my issue is alleviated at least for now, but would like to get back to the main branch. |
@Imperssonator I see, thank you for letting me know! I will be working on this then since it seems many people are affected by it. |
It's strange that this was not caught by system tests. |
Hello - I'm running into an issue that is proving very difficult to resolve. Have searched extensively on stack overflow, etc., and am running into this in multiple instances now.
When I have a table in BigQuery with an existing schema, if I upload new data from pandas with a
table_schema
that attempts to change one of the datatypes of the existing schema (let's say from STRING to INTEGER), withif_exists="replace"
, it will fail, with the following error (see stacktrace below, omitting calling function from our internal lib):Expected a string or bytes dtype, got int64
But the
table_schema
fields for this column, let's call itcolumn_1
, are:{'name': 'column_1', 'type': 'INT64'}
and
df['column_1'].dtype
yields'Int64'
Two interesting facts that make this harder to figure out:
column_1
from the upload, it works; it replaces the table and overwrites the schema (omittingcolumn_1
)... and then, when I try to overwrite again with the full dataframe (includingcolumn_1
), that works too, and it applies theINTEGER
datatype in BigQuery. (Note it does not applyINT64
even though I specify that in the table_schema).This suggests to me that it's not an issue with the underlying datatypes in the dataframe or the table_schema, because upload to a fresh BQ table works. For some reason,
pandas-gbq
is having a hard time changing existing datatypes in GBQ schemas even with full "replace" mode. I will note that I've been running into many similar issues with DATE/TIME types since upgrading as well, so it feels unlikely to me that this is isolated to justINT
s.Environment details
python --version
3.11.4pip --version
23.2.1Running on a conda env with:
pandas 2.1.0 pypi_0 pypi
pandas-gbq 0.19.2 pypi_0 pypi
pyarrow 13.0.0 pypi_0 pypi
google-api-core 2.11.1 pypi_0 pypi
google-api-python-client 2.98.0 pypi_0 pypi
google-auth 2.22.0 pypi_0 pypi
google-auth-httplib2 0.1.0 pypi_0 pypi
google-auth-oauthlib 1.0.0 pypi_0 pypi
google-cloud-bigquery 3.11.4 pypi_0 pypi
google-cloud-bigquery-storage 2.22.0 pypi_0 pypi
google-cloud-core 2.3.3 pypi_0 pypi
google-cloud-storage 2.10.0 pypi_0 pypi
google-crc32c 1.5.0 pypi_0 pypi
google-resumable-media 2.6.0 pypi_0 pypi
googleapis-common-protos 1.60.0 pypi_0 pypi
pydata-google-auth 1.8.2 pypi_0 pypi
Steps to reproduce
column_1
) of datatype "STRING"if_exists="replace"
) to that table wherecolumn_1
has pandas dtypeInt64
andtable_schema
specifies'type': 'INT64'
pyarrow.lib.ArrowTypeError: Expected a string or bytes dtype, got int64
will be thrownCode example
Can't provide entire code sample, but final upload statement looks like:
df_bq
has a column withInt64
datatypetable_schema_bq
contains{'name': 'column_1', 'type': 'INT64'}
Actual schema on
dest_bq_dataset_table
hasSTRING
forcolumn_1
EDIT I wrote a really simple test case that yields the same behavior:
The first
df.to_gbq
will work, thendf_int.to_gbq
throws the same error as above.table_schema
orastype("Int64")
/astype(int)
in the dataframe *Stack trace
The text was updated successfully, but these errors were encountered: