Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add filters and columns arguments to read_gbq for enhanced data querying #198

Merged
merged 23 commits into from
Dec 13, 2023

Conversation

Genesis929
Copy link
Collaborator

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes b/299514019 🦕

@Genesis929 Genesis929 requested review from a team as code owners November 13, 2023 14:29
@Genesis929 Genesis929 requested a review from shobsi November 13, 2023 14:29
@product-auto-label product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Nov 13, 2023
@Genesis929 Genesis929 requested a review from tswast November 13, 2023 15:55
@@ -486,6 +486,7 @@ def read_gbq(
index_col: Iterable[str] | str = (),
col_order: Iterable[str] = (),
max_results: Optional[int] = None,
filters: Optional[List[Tuple]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this an Iterable[Tuple] instead. I presume empty list of filters is semantically the same as None, right?

Also, we can document the supported operators with a Literal type.

Suggested change
filters: Optional[List[Tuple]] = None,
filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@@ -284,9 +284,11 @@ def read_gbq(
index_col: Iterable[str] | str = (),
col_order: Iterable[str] = (),
max_results: Optional[int] = None,
filters: Optional[List[Tuple]] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, let's update these types.

Suggested change
filters: Optional[List[Tuple]] = None
filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

if (filters is None) or (len(filters) == 0):
return query_or_table

valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use pandas / python syntax. We can transform to SQL with a dictionary.

Suggested change
valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="]
valid_operators = {
"in": "IN",
"not in": "NOT IN",
"==": "=",
">": ">",
"<": "<",
">=": ">=",
"<=": "<=",
"!=": "!=",
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


where_clause = " WHERE " + " OR ".join(grouped_expressions)

full_query = f"SELECT * FROM {sub_query} AS sub{where_clause}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to include the column filter here too, please.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant using col_order which is already used as a column filter.

value_list = ", ".join(
[f'"{v}"' if isinstance(v, str) else str(v) for v in value]
)
expression = f"{column} {operator} ({value_list})"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names could contain spaces. We need to enclose them in backticks.

Suggested change
expression = f"{column} {operator} ({value_list})"
expression = f"`{column}` {operator} ({value_list})"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

f"Value for operator {operator} should be a list."
)
value_list = ", ".join(
[f'"{v}"' if isinstance(v, str) else str(v) for v in value]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[f'"{v}"' if isinstance(v, str) else str(v) for v in value]
[repr(v) for v in value]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 366 to 367
value = f'"{value}"' if isinstance(value, str) else value
expression = f"{column} {operator} {value}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column names could contain spaces. We need to enclose them in backticks.

Suggested change
value = f'"{value}"' if isinstance(value, str) else value
expression = f"{column} {operator} {value}"
expression = f"`{column}` {operator} {repr(value)}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 342 to 344
if not isinstance(group, list):
raise ValueError("Each filter group should be a list.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the double nesting?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see: it's for the OR. You can ignore this feedback. :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the list names to or_expressions and and_expressions, to make the logic more clear.

Comment on lines 326 to 327
if not isinstance(filters, list):
raise ValueError("Filters should be a list.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to check this. Any Iterable should be fine.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 329 to 335
if not (
all(isinstance(item, list) for item in filters)
or all(isinstance(item, tuple) for item in filters)
):
raise ValueError(
"All items in filters should be either all lists or all tuples."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to just catch this when we encounter it so we can let them know which item was incorrect.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 357 to 360
if not isinstance(value, list):
raise ValueError(
f"Value for operator {operator} should be a list."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need this check. Any iterable should be OK.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@Genesis929 Genesis929 changed the title feat: Add filters argument to read_gbq for enhanced data querying feat: Add filters and columns arguments to read_gbq for enhanced data querying Nov 14, 2023
@@ -83,6 +84,13 @@ def read_gbq(
max_results (Optional[int], default None):
If set, limit the maximum number of rows to fetch from the
query results.
filters (List[Tuple], default []): To filter out data. Filter syntax:
[[(column, op, val), …],…] where op is [=, >, >=, <, <=, !=, in,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing a type annotation of List[Tuple], but [[(column, op, val), …],…] looks like a List[List[Tuple]]?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, thought I have made change here.

@@ -307,6 +309,72 @@ def read_gbq(
api_name="read_gbq",
)

def _filters_to_query(self, query_or_table, filters):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a great candidate to write unit tests (string-in-string-out). In fact, at a later point it can go in a sql helper module like bigframes/ml/sql.py.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to unit test.

FiltersType = (
Iterable[
Union[
Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this itself can be defined outside and reused for better readability

FilterType = Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]

FiltersType = Iterable[Union[FilterType, Iterable[FilterType]]]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

pytest.param(
"{scalars_table_id}",
[
(("rowindex", "not in", [0, 6])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double parentheses here is effectively same as single parentheses, did you mean it to represent a test case with Iterable[Iterable[Tuple]] filter type, then need to add a comma - (("rowindex", "not in", [0, 6]),)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, updated.

@Genesis929 Genesis929 requested review from tswast and shobsi November 14, 2023 17:50
sets of filters through an OR operation. A single list of tuples
can also be used, meaning that no OR operation between set of
filters is to be conducted.
filters (Iterable[Iterable[[Tuple]], default ()): To filter out data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this would be Iterable[Union[Tuple, Iterable[Tuple]]]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sets of filters through an OR operation. A single list of tuples
can also be used, meaning that no OR operation between set of
filters is to be conducted.
filters (Iterable[Iterable[[Tuple]], default ()): To filter out data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to add a code sample (in the EXAMPLES section)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample added.

@Genesis929 Genesis929 requested a review from shobsi November 20, 2023 20:36
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Genesis929 Please resolve the conflicts and ping me when this is ready for another review. Thanks.

@Genesis929 Genesis929 requested a review from tswast December 12, 2023 22:53
@@ -486,6 +487,8 @@ def read_gbq(
index_col: Iterable[str] | str = (),
col_order: Iterable[str] = (),
max_results: Optional[int] = None,
columns: Iterable[str] = (),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pull the columns change out into a separate PR to be reviewed outside of the filters change. IMO, columns and col_order are redundant and both should select a subset of columns. We need to keep both for compatibility see: googleapis/python-bigquery-pandas#701

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, removed.

use_cache: bool = True,
# Add a verify index argument that fails if the index is not unique.
) -> dataframe.DataFrame:
# TODO(b/281571214): Generate prompt to show the progress of read_gbq.
query_or_table = self._filters_to_query(query_or_table, columns, filters)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use col_order here. In a future PR, we can add columns as an alias for col_order. See: googleapis/python-bigquery-pandas#701

Suggested change
query_or_table = self._filters_to_query(query_or_table, columns, filters)
query_or_table = self._filters_to_query(query_or_table, col_order, filters)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, changed in read_gbq.

>>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant'])]
>>> df = bpd.read_gbq(
... "bigquery-public-data.baseball.games_wide",
... columns=columns,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, let's make columns an alias for col_order.

Suggested change
... columns=columns,
... col_order=columns,

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@tswast tswast merged commit 034f71f into main Dec 13, 2023
15 checks passed
@tswast tswast deleted the b299514019-read-gbq-filter branch December 13, 2023 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: m Pull request size is medium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants