-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add filters and columns arguments to read_gbq for enhanced data querying #198
Conversation
bigframes/pandas/__init__.py
Outdated
@@ -486,6 +486,7 @@ def read_gbq( | |||
index_col: Iterable[str] | str = (), | |||
col_order: Iterable[str] = (), | |||
max_results: Optional[int] = None, | |||
filters: Optional[List[Tuple]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this an Iterable[Tuple]
instead. I presume empty list of filters is semantically the same as None
, right?
Also, we can document the supported operators with a Literal type.
filters: Optional[List[Tuple]] = None, | |
filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
bigframes/session/__init__.py
Outdated
@@ -284,9 +284,11 @@ def read_gbq( | |||
index_col: Iterable[str] | str = (), | |||
col_order: Iterable[str] = (), | |||
max_results: Optional[int] = None, | |||
filters: Optional[List[Tuple]] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise, let's update these types.
filters: Optional[List[Tuple]] = None | |
filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
bigframes/session/__init__.py
Outdated
if (filters is None) or (len(filters) == 0): | ||
return query_or_table | ||
|
||
valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use pandas / python syntax. We can transform to SQL with a dictionary.
valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="] | |
valid_operators = { | |
"in": "IN", | |
"not in": "NOT IN", | |
"==": "=", | |
">": ">", | |
"<": "<", | |
">=": ">=", | |
"<=": "<=", | |
"!=": "!=", | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bigframes/session/__init__.py
Outdated
|
||
where_clause = " WHERE " + " OR ".join(grouped_expressions) | ||
|
||
full_query = f"SELECT * FROM {sub_query} AS sub{where_clause}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to include the column filter here too, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I meant using col_order
which is already used as a column filter.
bigframes/session/__init__.py
Outdated
value_list = ", ".join( | ||
[f'"{v}"' if isinstance(v, str) else str(v) for v in value] | ||
) | ||
expression = f"{column} {operator} ({value_list})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column names could contain spaces. We need to enclose them in backticks.
expression = f"{column} {operator} ({value_list})" | |
expression = f"`{column}` {operator} ({value_list})" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
bigframes/session/__init__.py
Outdated
f"Value for operator {operator} should be a list." | ||
) | ||
value_list = ", ".join( | ||
[f'"{v}"' if isinstance(v, str) else str(v) for v in value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[f'"{v}"' if isinstance(v, str) else str(v) for v in value] | |
[repr(v) for v in value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
bigframes/session/__init__.py
Outdated
value = f'"{value}"' if isinstance(value, str) else value | ||
expression = f"{column} {operator} {value}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column names could contain spaces. We need to enclose them in backticks.
value = f'"{value}"' if isinstance(value, str) else value | |
expression = f"{column} {operator} {value}" | |
expression = f"`{column}` {operator} {repr(value)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
bigframes/session/__init__.py
Outdated
if not isinstance(group, list): | ||
raise ValueError("Each filter group should be a list.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the double nesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see: it's for the OR. You can ignore this feedback. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the list names to or_expressions and and_expressions, to make the logic more clear.
bigframes/session/__init__.py
Outdated
if not isinstance(filters, list): | ||
raise ValueError("Filters should be a list.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need to check this. Any Iterable should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bigframes/session/__init__.py
Outdated
if not ( | ||
all(isinstance(item, list) for item in filters) | ||
or all(isinstance(item, tuple) for item in filters) | ||
): | ||
raise ValueError( | ||
"All items in filters should be either all lists or all tuples." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to just catch this when we encounter it so we can let them know which item was incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bigframes/session/__init__.py
Outdated
if not isinstance(value, list): | ||
raise ValueError( | ||
f"Value for operator {operator} should be a list." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need this check. Any iterable should be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -83,6 +84,13 @@ def read_gbq( | |||
max_results (Optional[int], default None): | |||
If set, limit the maximum number of rows to fetch from the | |||
query results. | |||
filters (List[Tuple], default []): To filter out data. Filter syntax: | |||
[[(column, op, val), …],…] where op is [=, >, >=, <, <=, !=, in, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are doing a type annotation of List[Tuple]
, but [[(column, op, val), …],…]
looks like a List[List[Tuple]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, thought I have made change here.
bigframes/session/__init__.py
Outdated
@@ -307,6 +309,72 @@ def read_gbq( | |||
api_name="read_gbq", | |||
) | |||
|
|||
def _filters_to_query(self, query_or_table, filters): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a great candidate to write unit tests (string-in-string-out). In fact, at a later point it can go in a sql helper module like bigframes/ml/sql.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to unit test.
FiltersType = ( | ||
Iterable[ | ||
Union[ | ||
Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this itself can be defined outside and reused for better readability
FilterType = Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]
FiltersType = Iterable[Union[FilterType, Iterable[FilterType]]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tests/system/small/test_session.py
Outdated
pytest.param( | ||
"{scalars_table_id}", | ||
[ | ||
(("rowindex", "not in", [0, 6])), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double parentheses here is effectively same as single parentheses, did you mean it to represent a test case with Iterable[Iterable[Tuple]]
filter type, then need to add a comma - (("rowindex", "not in", [0, 6]),)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, updated.
sets of filters through an OR operation. A single list of tuples | ||
can also be used, meaning that no OR operation between set of | ||
filters is to be conducted. | ||
filters (Iterable[Iterable[[Tuple]], default ()): To filter out data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this would be Iterable[Union[Tuple, Iterable[Tuple]]]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
sets of filters through an OR operation. A single list of tuples | ||
can also be used, meaning that no OR operation between set of | ||
filters is to be conducted. | ||
filters (Iterable[Iterable[[Tuple]], default ()): To filter out data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be great to add a code sample (in the EXAMPLES section)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Genesis929 Please resolve the conflicts and ping me when this is ready for another review. Thanks.
bigframes/pandas/__init__.py
Outdated
@@ -486,6 +487,8 @@ def read_gbq( | |||
index_col: Iterable[str] | str = (), | |||
col_order: Iterable[str] = (), | |||
max_results: Optional[int] = None, | |||
columns: Iterable[str] = (), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pull the columns
change out into a separate PR to be reviewed outside of the filters change. IMO, columns
and col_order
are redundant and both should select a subset of columns. We need to keep both for compatibility see: googleapis/python-bigquery-pandas#701
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, removed.
bigframes/session/__init__.py
Outdated
use_cache: bool = True, | ||
# Add a verify index argument that fails if the index is not unique. | ||
) -> dataframe.DataFrame: | ||
# TODO(b/281571214): Generate prompt to show the progress of read_gbq. | ||
query_or_table = self._filters_to_query(query_or_table, columns, filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use col_order
here. In a future PR, we can add columns
as an alias for col_order
. See: googleapis/python-bigquery-pandas#701
query_or_table = self._filters_to_query(query_or_table, columns, filters) | |
query_or_table = self._filters_to_query(query_or_table, col_order, filters) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, changed in read_gbq.
>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant'])] | ||
>>> df = bpd.read_gbq( | ||
... "bigquery-public-data.baseball.games_wide", | ||
... columns=columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future PR, let's make columns
an alias for col_order
.
... columns=columns, | |
... col_order=columns, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes b/299514019 🦕