Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas should get the schema from bigquery if pushing to a table that already exists #315

Closed
ShantanuKumar opened this issue Mar 7, 2020 · 6 comments · Fixed by #318
Closed
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@ShantanuKumar
Copy link
Contributor

Right now, when pushing new data to an already existing table using to_gbq, with option if_exists=append, but no explicit table_schema, pandas generates a default table schema, where the mode of the column, which takes value either REQUIRED or NULLABLE, by default is always NULLABLE.

It would make sense for pandas to fetch schema, and apply those for case where if_exists=append instead of passing a NULLABLE mode.

@tswast
Copy link
Collaborator

tswast commented Mar 9, 2020

pandas-gbq already calls get_table to check if a table exists. I think this could be a matter of plumbing the right information through.

Note: you'll need to filter out any columns that aren't in the DataFrame.

@tswast tswast added the type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. label Mar 9, 2020
@tswast
Copy link
Collaborator

tswast commented Mar 9, 2020

@ShantanuKumar Would you mind sharing an example of when the default schema gets it wrong? That will be useful for testing.

@ShantanuKumar
Copy link
Contributor Author

ShantanuKumar commented Mar 10, 2020

Imagine there is a table called event_data with this schema

[
    {
        "name": "event_ts",
        "type": "TIMESTAMP",
        "mode": "REQUIRED"
    },
    {
        "name": "event_type",
        "type": "STRING",
        "mode": "NULLABLE"
    }
]

Now, when we have some dataframe

df = pd.DatFrame({
    "event_ts": ["2020-03-03 01:00:00", "2020-03-03 02:00:00"],
    "event_type": ["buy", "sell"]
})

When I push this data to the table event_data, using

df.to_gbq(
    destination_table="event_ts",
    project_id=PROJECT_ID,
    if_exists="append",
)

I get this error

"Please verify that the structure and "
pandas_gbq.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.

This is happening because of the REQUIRED mode for event_ts, which pandas isn't obeying, and pushing its own schema where event_ts mode is NULLABLE

@ShantanuKumar
Copy link
Contributor Author

ShantanuKumar commented Apr 25, 2020

@tswast What's the reason behind ignoring the mode when comparing schema?

https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L646

The issue which I have right now is because of mismatching mode when appending data to an exsiting table. pandas-gbq by default always pushes NULLABLE mode which creates an issue if the existing table has some column with REQUIRED mode

Also I think this issue should be marked as BUG.

@tswast
Copy link
Collaborator

tswast commented Apr 29, 2020

I made a test corresponding to the example you provided in #315 (comment) but it still fails due to different types. It is expected that the pandas.Timestamp dtype is used for uploading to TIMESTAMP columns. Please open a separate feature request for that issue if different types is a problem for you.

Let's use this issue to track the problem different modes (required vs. nullable).

def test_to_gbq_does_not_override_type(gbq_table, gbq_connector):
    table_id = "test_to_gbq_does_not_override_type"
    table_schema = {
        "fields": [
            {
                "name": "event_ts",
                "type": "TIMESTAMP",
            },
            {
                "name": "event_type",
                "type": "STRING",
            },
        ]
    }
    df = DataFrame({
        "event_ts": ["2020-03-03 01:00:00", "2020-03-03 02:00:00"],
        "event_type": ["buy", "sell"]
    })

    gbq_table.create(table_id, table_schema)
    gbq.to_gbq(
        df,
        "{0}.{1}".format(gbq_table.dataset_id, table_id),
        project_id=gbq_connector.project_id,
        if_exists="append",
    )

    actual = gbq_table.schema(table_id)
    assert table_schema["fields"] == actual

@ShantanuKumar
Copy link
Contributor Author

ShantanuKumar commented Apr 30, 2020

It actually works for TIMESTAMP also ! We just needed to cast the string timestamp to datetime

def test_to_gbq_does_not_override_type(gbq_table, gbq_connector):
    table_id = "test_to_gbq_does_not_override_type"
    table_schema = {
        "fields": [
            {

                "name": "event_ts",
                "type": "TIMESTAMP",
                "mode": "REQUIRED",
                "description": "event_ts",
            },
            {
                "name": "event_type",
                "type": "STRING",
                "mode": "NULLABLE",
                "description": "event_type",
            },
        ]
    }
    df = DataFrame({
        "event_ts": [pandas.to_datetime("2020-03-03 01:00:00"),
                     pandas.to_datetime("2020-03-03 01:00:00")],
        "event_type": ["buy", "sell"]
    })

    gbq_table.create(table_id, table_schema)
    gbq.to_gbq(
        df,
        "{0}.{1}".format(gbq_table.dataset_id, table_id),
        project_id=gbq_connector.project_id,
        if_exists="append",
    )

    actual = gbq_table.schema(gbq_table.dataset_id, table_id)
    assert table_schema["fields"] == actual

Imperssonator added a commit to Imperssonator/python-bigquery-pandas that referenced this issue Sep 21, 2023
…ists="replace" (googleapis#670)

A previous issue (googleapis#315) noted that when using `if_exists="append"`, pandas should retrieve the existing schema from BQ and update the user-supplied `table_schema` to ensure that all column modes (REQUIRED/NULLABLE) match those of the existing table.

The commit that fixed this issue applies this policy to *all* write_dispositions, but it is only applicable when appending. When replacing, the user should be able to overwrite with new datatypes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants