Structs lack proper names as dicts and arrays get turned into array of dicts #23

jasonqng · 2017-03-22T15:32:44Z

Version 0.1.4

This query returns a improperly named dict:

q = """
select struct(a,b) col
from
(SELECT 1 a, 2 b)
"""
df = gbq.read_gbq(q, dialect='standard', verbose=False)

Compare with result from Big Query:

An array of items also get turned into a arrays of dicts sometimes. For example:

q = """
select array_agg(a)
from
(select "1" a UNION ALL select "2" a)
"""
gbq.read_gbq(q, dialect='standard', verbose=False, project_id='project')

outputs:

Compare to Big Query:

These issues may or may not be related?

The text was updated successfully, but these errors were encountered:

jreback · 2017-03-22T17:23:35Z

@jasonqng if you wouldn't mind debugging this locally and see what the raw returns from bq are for these cases (and show here)?

IOW about here: https://github.com/pydata/pandas-gbq/blob/master/pandas_gbq/gbq.py#L607

jasonqng · 2017-03-22T21:37:42Z

This is what I get back when running from bq command line:

query --use_legacy_sql=False select struct(a,b) col from (SELECT 1 a, 2 b)
+-------------------+
|        col        |
+-------------------+
| {"a":"1","b":"2"} |
+-------------------+

jasonqng · 2017-03-22T21:40:57Z

What I get for the second case, I get a proper list/array:

query --use_legacy_sql=False select array_agg(a) from (select "1" a UNION ALL select "2" a)
+--------------+
|     f0_      |
+--------------+
| [u'1', u'2'] |
+--------------+

I'll tinker around a bit more tomorrow. But looks like this shouldn't be too hard to fix. 🤞

parthea · 2017-03-22T21:51:18Z

Including the result from the api log as well (using bq --apilog=-):

{
 "kind": "bigquery#getQueryResultsResponse",
 "etag": "",
 "schema": {
  "fields": [
   {
    "name": "col",
    "type": "RECORD",
    "mode": "NULLABLE",
    "fields": [
     {
      "name": "a",
      "type": "INTEGER",
      "mode": "NULLABLE"
     },
     {
      "name": "b",
      "type": "INTEGER",
      "mode": "NULLABLE"
     }
    ]
   }
  ]
 },
 "jobReference": {
  "projectId": "x",
  "jobId": "bqjob_x"
 },
 "totalRows": "1",
 "rows": [
  {
   "f": [
    {
     "v": {
      "f": [
       {
        "v": "1"
       },
       {
        "v": "2"
       }
      ]
     }
    }
   ]
  }
 ],
 "totalBytesProcessed": "0",
 "jobComplete": true,
 "cacheHit": true
}

Full bq command:

bq query --apilog=- --format=prettyjson --use_legacy_sql=False  "select struct(a,b) col from (SELECT 1 a, 2 b)"

jreback · 2017-03-22T22:23:40Z

So this is not the correct way to do this. You would actually construct directly the result, but an idea. You will have to figure out semantically what the result is and how to describe this most usefully while representing the structure. Probably a MultiIndex on the columns makes sense here.

In [27]: df = DataFrame({'col':[{"a":"1","b":"2"}]})

In [28]: df
Out[28]: 
                    col
0  {'b': '2', 'a': '1'}

In [29]: df.unstack().apply(Series)
Out[29]: 
       a  b
col 0  1  2

Keep in mind that you want to the most natural representation. If its not obvious, then don't do it.

jasonqng · 2017-03-22T23:09:16Z

@jreback Not sure what you mean that this is not the correct way to do this. Certainly, the output you produce is one way of outputting the results (in a flattened way) and it can definitely be produced once you have the result in proper dict form. But as seen in the bq command line output, a dict is the expected output of col as opposed to the current behavior where the field name is lost altogether. Poked around and it looks like the field type of a struct in the schema returned from bq is RECORD; if I'm reading this right, it's just a matter of adding logic to replace the v keys to the corresponding fields["name"] from the schema.

Output:

Similar debugging info for the latter query above where the returned column is an array of two strings:

schema {u'fields': [{u'type': u'STRING', u'name': u'f0_', u'mode': u'REPEATED'}]}
rows [{u'f': [{u'v': [{u'v': u'1'}, {u'v': u'2'}]}]}]
field_value [{u'v': u'1'}, {u'v': u'2'}]
field_type STRING
page_array [([{u'v': u'1'}, {u'v': u'2'}],)]

jreback · 2017-03-22T23:27:07Z

@jasonqng I mean the idea here is to preserve the structure, but at the same time make it useful.

As far as contstruction you also might be able to directly construct with DataFrame.from_records; this is how we do this from SQL (where each row is a record).

jreback · 2017-04-03T19:00:07Z

@jasonqng want to take a crack at this?

jasonqng · 2017-04-04T23:00:55Z

@jreback Sure, have a hack day on Friday, hopefully will have time to give it a shot. Thanks.

jreback · 2017-04-04T23:23:59Z

great!

jasonqng · 2017-04-07T22:53:37Z

Just curious, how wedded is this project to google-api-python-client? I started to code up something for converting the mangled struct into the correct schema, but recalled that the google-cloud-python package doesn't have this issue, they already do a nice clean up of the field names for structs and arrays. Try for example:

from google.cloud import bigquery
bigquery_client = bigquery.Client()
q = """
select ROW_NUMBER() over () row_num, struct(a,b) col
from
(select * from
    (SELECT 1 a, 2 b)
    UNION ALL
    (SELECT 5 a, 6 b)
)
"""
table = bigquery.query.QueryResults(query=q,client=bigquery_client)
table.use_legacy_sql = False
table.run()
data = table.fetch_data()[0]
print data
columns=[field.name for field in table.schema]
print columns
df = pd.DataFrame(data=data,columns=columns)
df

Output:

[(2, {u'a': 5, u'b': 6}), (1, {u'a': 1, u'b': 2})]
[u'row_num', u'col']

row_num	col
2	{u'a': 5, u'b': 6}
1	{u'a': 1, u'b': 2}

jasonqng · 2017-04-07T23:06:01Z

Same goes for arrays:

from google.cloud import bigquery
bigquery_client = bigquery.Client()
q = """
select array_agg(a) my_list
from
(select "1" a UNION ALL select "2" a)
"""
table = bigquery.query.QueryResults(query=q,client=bigquery_client)
table.use_legacy_sql = False
table.run()
data = table.fetch_data()[0]
columns=[field.name for field in table.schema]
df = pd.DataFrame(data=data,columns=columns)
df

my_list
[1, 2]

Looking at https://developers.google.com/api-client-library/python/apis/bigquery/v2, Google recommends switching over to google-cloud-python for new projects. Perhaps it might save some heartache to slowly transition pandas-gbq over if more and more energy goes toward google-cloud-python? I'm happy to swap out the code in read_gbq() over to google.cloud if that's helpful. As you can see from the above code, it's pretty straightforward and could easily be adapted to handle the current functionality of read_gbq().

jreback · 2017-04-07T23:08:17Z

@jasonqng love to make the switch, The only blocker would be that ideally would want the deps built on conda-forge (Its not very hard), though they seem to have lots of mini-libraries which is a bit odd.

jasonqng · 2017-04-07T23:10:58Z

@jreback I'm not as familiar with dealing with conda and dependency stuff, so would it make sense for me to go ahead and open a PR for switching over read_gbq() and I'll let you handle the other end?

jreback · 2017-04-07T23:17:39Z

sure i can fix that

tswast · 2017-12-20T22:19:51Z

This was (likely) fixed by #25, but I'd like to make sure we have a test in the test suite before closing it out.

jasonqng · 2017-12-30T05:04:05Z

PR (and bugfix) for tests here: #101

jasonqng changed the title ~~Structs lack proper names as dicts and lists get turned into dicts~~ Structs lack proper names as dicts and arrays get turned into dicts Mar 22, 2017

jasonqng changed the title ~~Structs lack proper names as dicts and arrays get turned into dicts~~ Structs lack proper names as dicts and arrays get turned into array of dicts Mar 22, 2017

parthea added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. help wanted We'd love to have community involvement on this issue. labels Mar 22, 2017

jasonqng mentioned this issue Apr 7, 2017

array_agg() fails because of TypeError #22

Closed

jasonqng mentioned this issue Apr 8, 2017

ENH: Convert read_gbq() function to use google-cloud-python #25

Merged

jasonqng mentioned this issue Dec 30, 2017

BUG: Fix bug in type conversion of arrays; add array and struct tests #101

Merged

jasonqng closed this as completed Jan 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structs lack proper names as dicts and arrays get turned into array of dicts #23

Structs lack proper names as dicts and arrays get turned into array of dicts #23

jasonqng commented Mar 22, 2017 •

edited

Loading

jreback commented Mar 22, 2017

jasonqng commented Mar 22, 2017 •

edited

Loading

jasonqng commented Mar 22, 2017 •

edited

Loading

parthea commented Mar 22, 2017

jreback commented Mar 22, 2017

jasonqng commented Mar 22, 2017 •

edited

Loading

jreback commented Mar 22, 2017

jreback commented Apr 3, 2017

jasonqng commented Apr 4, 2017

jreback commented Apr 4, 2017

jasonqng commented Apr 7, 2017 •

edited

Loading

jasonqng commented Apr 7, 2017

jreback commented Apr 7, 2017

jasonqng commented Apr 7, 2017

jreback commented Apr 7, 2017

tswast commented Dec 20, 2017

jasonqng commented Dec 30, 2017

Structs lack proper names as dicts and arrays get turned into array of dicts #23

Structs lack proper names as dicts and arrays get turned into array of dicts #23

Comments

jasonqng commented Mar 22, 2017 • edited Loading

jreback commented Mar 22, 2017

jasonqng commented Mar 22, 2017 • edited Loading

jasonqng commented Mar 22, 2017 • edited Loading

parthea commented Mar 22, 2017

jreback commented Mar 22, 2017

jasonqng commented Mar 22, 2017 • edited Loading

jreback commented Mar 22, 2017

jreback commented Apr 3, 2017

jasonqng commented Apr 4, 2017

jreback commented Apr 4, 2017

jasonqng commented Apr 7, 2017 • edited Loading

jasonqng commented Apr 7, 2017

jreback commented Apr 7, 2017

jasonqng commented Apr 7, 2017

jreback commented Apr 7, 2017

tswast commented Dec 20, 2017

jasonqng commented Dec 30, 2017

jasonqng commented Mar 22, 2017 •

edited

Loading

jasonqng commented Mar 22, 2017 •

edited

Loading

jasonqng commented Mar 22, 2017 •

edited

Loading

jasonqng commented Mar 22, 2017 •

edited

Loading

jasonqng commented Apr 7, 2017 •

edited

Loading