Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structs lack proper names as dicts and arrays get turned into array of dicts #23

Closed
jasonqng opened this issue Mar 22, 2017 · 17 comments
Closed
Labels
help wanted We'd love to have community involvement on this issue. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@jasonqng
Copy link
Contributor

jasonqng commented Mar 22, 2017

Version 0.1.4

This query returns a improperly named dict:

q = """
select struct(a,b) col
from
(SELECT 1 a, 2 b)
"""
df = gbq.read_gbq(q, dialect='standard', verbose=False)

image

Compare with result from Big Query:
image

An array of items also get turned into a arrays of dicts sometimes. For example:

q = """
select array_agg(a)
from
(select "1" a UNION ALL select "2" a)
"""
gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')

outputs:
image

Compare to Big Query:
image

These issues may or may not be related?

@jasonqng jasonqng changed the title Structs lack proper names as dicts and lists get turned into dicts Structs lack proper names as dicts and arrays get turned into dicts Mar 22, 2017
@jasonqng jasonqng changed the title Structs lack proper names as dicts and arrays get turned into dicts Structs lack proper names as dicts and arrays get turned into array of dicts Mar 22, 2017
@parthea parthea added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. help wanted We'd love to have community involvement on this issue. labels Mar 22, 2017
@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

@jasonqng if you wouldn't mind debugging this locally and see what the raw returns from bq are for these cases (and show here)?

IOW about here: https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L607

@jasonqng
Copy link
Contributor Author

jasonqng commented Mar 22, 2017

This is what I get back when running from bq command line:

query --use_legacy_sql=False select struct(a,b) col from (SELECT 1 a, 2 b)
+-------------------+
|        col        |
+-------------------+
| {"a":"1","b":"2"} |
+-------------------+

@jasonqng
Copy link
Contributor Author

jasonqng commented Mar 22, 2017

What I get for the second case, I get a proper list/array:

query --use_legacy_sql=False select array_agg(a) from (select "1" a UNION ALL select "2" a)
+--------------+
|     f0_      |
+--------------+
| [u'1', u'2'] |
+--------------+

I'll tinker around a bit more tomorrow. But looks like this shouldn't be too hard to fix. 🤞

@parthea
Copy link
Contributor

parthea commented Mar 22, 2017

Including the result from the api log as well (using bq --apilog=-):

{
 "kind": "bigquery#getQueryResultsResponse",
 "etag": "",
 "schema": {
  "fields": [
   {
    "name": "col",
    "type": "RECORD",
    "mode": "NULLABLE",
    "fields": [
     {
      "name": "a",
      "type": "INTEGER",
      "mode": "NULLABLE"
     },
     {
      "name": "b",
      "type": "INTEGER",
      "mode": "NULLABLE"
     }
    ]
   }
  ]
 },
 "jobReference": {
  "projectId": "x",
  "jobId": "bqjob_x"
 },
 "totalRows": "1",
 "rows": [
  {
   "f": [
    {
     "v": {
      "f": [
       {
        "v": "1"
       },
       {
        "v": "2"
       }
      ]
     }
    }
   ]
  }
 ],
 "totalBytesProcessed": "0",
 "jobComplete": true,
 "cacheHit": true
}

Full bq command:

bq query --apilog=- --format=prettyjson --use_legacy_sql=False  "select struct(a,b) col from (SELECT 1 a, 2 b)"

@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

So this is not the correct way to do this. You would actually construct directly the result, but an idea. You will have to figure out semantically what the result is and how to describe this most usefully while representing the structure. Probably a MultiIndex on the columns makes sense here.

In [27]: df = DataFrame({'col':[{"a":"1","b":"2"}]})

In [28]: df
Out[28]: 
                    col
0  {'b': '2', 'a': '1'}

In [29]: df.unstack().apply(Series)
Out[29]: 
       a  b
col 0  1  2

Keep in mind that you want to the most natural representation. If its not obvious, then don't do it.

@jasonqng
Copy link
Contributor Author

jasonqng commented Mar 22, 2017

@jreback Not sure what you mean that this is not the correct way to do this. Certainly, the output you produce is one way of outputting the results (in a flattened way) and it can definitely be produced once you have the result in proper dict form. But as seen in the bq command line output, a dict is the expected output of col as opposed to the current behavior where the field name is lost altogether. Poked around and it looks like the field type of a struct in the schema returned from bq is RECORD; if I'm reading this right, it's just a matter of adding logic to replace the v keys to the corresponding fields["name"] from the schema.

Output:
image

Similar debugging info for the latter query above where the returned column is an array of two strings:

schema {u'fields': [{u'type': u'STRING', u'name': u'f0_', u'mode': u'REPEATED'}]}
rows [{u'f': [{u'v': [{u'v': u'1'}, {u'v': u'2'}]}]}]
field_value [{u'v': u'1'}, {u'v': u'2'}]
field_type STRING
page_array [([{u'v': u'1'}, {u'v': u'2'}],)]

@jreback
Copy link
Contributor

jreback commented Mar 22, 2017

@jasonqng I mean the idea here is to preserve the structure, but at the same time make it useful.

As far as contstruction you also might be able to directly construct with DataFrame.from_records; this is how we do this from SQL (where each row is a record).

@jreback
Copy link
Contributor

jreback commented Apr 3, 2017

@jasonqng want to take a crack at this?

@jasonqng
Copy link
Contributor Author

jasonqng commented Apr 4, 2017

@jreback Sure, have a hack day on Friday, hopefully will have time to give it a shot. Thanks.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2017

great!

@jasonqng
Copy link
Contributor Author

jasonqng commented Apr 7, 2017

Just curious, how wedded is this project to google-api-python-client? I started to code up something for converting the mangled struct into the correct schema, but recalled that the google-cloud-python package doesn't have this issue, they already do a nice clean up of the field names for structs and arrays. Try for example:

from google.cloud import bigquery
bigquery_client = bigquery.Client()
q = """
select ROW_NUMBER() over () row_num, struct(a,b) col
from
(select * from
    (SELECT 1 a, 2 b)
    UNION ALL
    (SELECT 5 a, 6 b)
)
"""
table = bigquery.query.QueryResults(query=q,client=bigquery_client)
table.use_legacy_sql = False
table.run()
data = table.fetch_data()[0]
print data
columns=[field.name for field in table.schema]
print columns
df = pd.DataFrame(data=data,columns=columns)
df

Output:

[(2, {u'a': 5, u'b': 6}), (1, {u'a': 1, u'b': 2})]
[u'row_num', u'col']
row_num col
2 {u'a': 5, u'b': 6}
1 {u'a': 1, u'b': 2}

@jasonqng
Copy link
Contributor Author

jasonqng commented Apr 7, 2017

Same goes for arrays:

from google.cloud import bigquery
bigquery_client = bigquery.Client()
q = """
select array_agg(a) my_list
from
(select "1" a UNION ALL select "2" a)
"""
table = bigquery.query.QueryResults(query=q,client=bigquery_client)
table.use_legacy_sql = False
table.run()
data = table.fetch_data()[0]
columns=[field.name for field in table.schema]
df = pd.DataFrame(data=data,columns=columns)
df
my_list
[1, 2]

Looking at https://developers.google.com/api-client-library/python/apis/bigquery/v2, Google recommends switching over to google-cloud-python for new projects. Perhaps it might save some heartache to slowly transition pandas-gbq over if more and more energy goes toward google-cloud-python? I'm happy to swap out the code in read_gbq() over to google.cloud if that's helpful. As you can see from the above code, it's pretty straightforward and could easily be adapted to handle the current functionality of read_gbq().

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

@jasonqng love to make the switch, The only blocker would be that ideally would want the deps built on conda-forge (Its not very hard), though they seem to have lots of mini-libraries which is a bit odd.

@jasonqng
Copy link
Contributor Author

jasonqng commented Apr 7, 2017

@jreback I'm not as familiar with dealing with conda and dependency stuff, so would it make sense for me to go ahead and open a PR for switching over read_gbq() and I'll let you handle the other end?

@jreback
Copy link
Contributor

jreback commented Apr 7, 2017

sure i can fix that

@tswast
Copy link
Collaborator

tswast commented Dec 20, 2017

This was (likely) fixed by #25, but I'd like to make sure we have a test in the test suite before closing it out.

@jasonqng
Copy link
Contributor Author

PR (and bugfix) for tests here: #101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted We'd love to have community involvement on this issue. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

4 participants