feat: Add filters and columns arguments to read_gbq for enhanced data querying #198

Genesis929 · 2023-11-13T14:29:54Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes b/299514019 🦕

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast · 2023-11-13T16:47:35Z

bigframes/pandas/__init__.py

@@ -486,6 +486,7 @@ def read_gbq(
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None,
+    filters: Optional[List[Tuple]] = None,


Let's make this an Iterable[Tuple] instead. I presume empty list of filters is semantically the same as None, right?

Also, we can document the supported operators with a Literal type.

Suggested change

filters: Optional[List[Tuple]] = None,

filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

tswast · 2023-11-13T16:48:07Z

bigframes/session/__init__.py

@@ -284,9 +284,11 @@ def read_gbq(
        index_col: Iterable[str] | str = (),
        col_order: Iterable[str] = (),
        max_results: Optional[int] = None,
+        filters: Optional[List[Tuple]] = None


Likewise, let's update these types.

Suggested change

filters: Optional[List[Tuple]] = None

filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

tswast · 2023-11-13T16:51:36Z

bigframes/session/__init__.py

+        if (filters is None) or (len(filters) == 0):
+            return query_or_table
+
+        valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="]


Please use pandas / python syntax. We can transform to SQL with a dictionary.

Suggested change

valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="]

valid_operators = {

"in": "IN",

"not in": "NOT IN",

"==": "=",

">": ">",

"<": "<",

">=": ">=",

"<=": "<=",

"!=": "!=",

}

tswast · 2023-11-13T16:52:07Z

bigframes/session/__init__.py

+
+            where_clause = " WHERE " + " OR ".join(grouped_expressions)
+
+        full_query = f"SELECT * FROM {sub_query} AS sub{where_clause}"


I'd like to include the column filter here too, please.

Done, added.

Sorry, I meant using col_order which is already used as a column filter.

tswast · 2023-11-13T16:55:46Z

bigframes/session/__init__.py

+                        value_list = ", ".join(
+                            [f'"{v}"' if isinstance(v, str) else str(v) for v in value]
+                        )
+                        expression = f"{column} {operator} ({value_list})"


Column names could contain spaces. We need to enclose them in backticks.

Suggested change

expression = f"{column} {operator} ({value_list})"

expression = f"`{column}` {operator} ({value_list})"

tswast · 2023-11-13T16:57:31Z

bigframes/session/__init__.py

+                                f"Value for operator {operator} should be a list."
+                            )
+                        value_list = ", ".join(
+                            [f'"{v}"' if isinstance(v, str) else str(v) for v in value]


Suggested change

[f'"{v}"' if isinstance(v, str) else str(v) for v in value]

[repr(v) for v in value]

tswast · 2023-11-13T16:58:40Z

bigframes/session/__init__.py

+                        value = f'"{value}"' if isinstance(value, str) else value
+                        expression = f"{column} {operator} {value}"


Column names could contain spaces. We need to enclose them in backticks.

Suggested change

value = f'"{value}"' if isinstance(value, str) else value

expression = f"{column} {operator} {value}"

expression = f"`{column}` {operator} {repr(value)}"

tswast · 2023-11-13T16:59:53Z

bigframes/session/__init__.py

+                if not isinstance(group, list):
+                    raise ValueError("Each filter group should be a list.")
+


Why the double nesting?

Oh, I see: it's for the OR. You can ignore this feedback. :-)

Changed the list names to or_expressions and and_expressions, to make the logic more clear.

tswast · 2023-11-13T17:00:09Z

bigframes/session/__init__.py

+            if not isinstance(filters, list):
+                raise ValueError("Filters should be a list.")


Don't need to check this. Any Iterable should be fine.

tswast · 2023-11-13T17:00:52Z

bigframes/session/__init__.py

+            if not (
+                all(isinstance(item, list) for item in filters)
+                or all(isinstance(item, tuple) for item in filters)
+            ):
+                raise ValueError(
+                    "All items in filters should be either all lists or all tuples."
+                )


Better to just catch this when we encounter it so we can let them know which item was incorrect.

tswast · 2023-11-13T17:02:58Z

bigframes/session/__init__.py

+                        if not isinstance(value, list):
+                            raise ValueError(
+                                f"Value for operator {operator} should be a list."
+                            )


We don't need this check. Any iterable should be OK.

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

shobsi · 2023-11-14T01:32:37Z

third_party/bigframes_vendored/pandas/io/gbq.py

@@ -83,6 +84,13 @@ def read_gbq(
            max_results (Optional[int], default None):
                If set, limit the maximum number of rows to fetch from the
                query results.
+            filters (List[Tuple], default []): To filter out data. Filter syntax:
+            [[(column, op, val), …],…] where op is [=, >, >=, <, <=, !=, in,


We are doing a type annotation of List[Tuple], but [[(column, op, val), …],…] looks like a List[List[Tuple]]?

Sorry, thought I have made change here.

shobsi · 2023-11-14T05:59:04Z

bigframes/session/__init__.py

@@ -307,6 +309,72 @@ def read_gbq(
                api_name="read_gbq",
            )

+    def _filters_to_query(self, query_or_table, filters):


Seems like a great candidate to write unit tests (string-in-string-out). In fact, at a later point it can go in a sql helper module like bigframes/ml/sql.py.

Moved to unit test.

shobsi · 2023-11-14T06:05:20Z

third_party/bigframes_vendored/pandas/io/gbq.py

+FiltersType = (
+    Iterable[
+        Union[
+            Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any],


Looks like this itself can be defined outside and reused for better readability

FilterType = Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any] FiltersType = Iterable[Union[FilterType, Iterable[FilterType]]]

shobsi · 2023-11-14T06:12:15Z

tests/system/small/test_session.py

+        pytest.param(
+            "{scalars_table_id}",
+            [
+                (("rowindex", "not in", [0, 6])),


double parentheses here is effectively same as single parentheses, did you mean it to represent a test case with Iterable[Iterable[Tuple]] filter type, then need to add a comma - (("rowindex", "not in", [0, 6]),)

Done, updated.

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

shobsi · 2023-11-17T01:29:44Z

third_party/bigframes_vendored/pandas/io/gbq.py

-                sets of filters through an OR operation. A single list of tuples
-                can also be used, meaning that no OR operation between set of
-                filters is to be conducted.
+            filters (Iterable[Iterable[[Tuple]], default ()): To filter out data.


Looks like this would be Iterable[Union[Tuple, Iterable[Tuple]]]

shobsi · 2023-11-17T01:31:39Z

third_party/bigframes_vendored/pandas/io/gbq.py

-                sets of filters through an OR operation. A single list of tuples
-                can also be used, meaning that no OR operation between set of
-                filters is to be conducted.
+            filters (Iterable[Iterable[[Tuple]], default ()): To filter out data.


Would be great to add a code sample (in the EXAMPLES section)

sample added.

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast

@Genesis929 Please resolve the conflicts and ping me when this is ready for another review. Thanks.

tswast · 2023-12-13T17:08:20Z

bigframes/pandas/__init__.py

@@ -486,6 +487,8 @@ def read_gbq(
    index_col: Iterable[str] | str = (),
    col_order: Iterable[str] = (),
    max_results: Optional[int] = None,
+    columns: Iterable[str] = (),


Please pull the columns change out into a separate PR to be reviewed outside of the filters change. IMO, columns and col_order are redundant and both should select a subset of columns. We need to keep both for compatibility see: googleapis/python-bigquery-pandas#701

Done, removed.

tswast · 2023-12-13T17:09:38Z

bigframes/session/__init__.py

        use_cache: bool = True,
        # Add a verify index argument that fails if the index is not unique.
    ) -> dataframe.DataFrame:
        # TODO(b/281571214): Generate prompt to show the progress of read_gbq.
+        query_or_table = self._filters_to_query(query_or_table, columns, filters)


Please use col_order here. In a future PR, we can add columns as an alias for col_order. See: googleapis/python-bigquery-pandas#701

Suggested change

query_or_table = self._filters_to_query(query_or_table, columns, filters)

query_or_table = self._filters_to_query(query_or_table, col_order, filters)

Done, changed in read_gbq.

tswast · 2023-12-13T17:10:55Z

third_party/bigframes_vendored/pandas/io/gbq.py

+            >>> filters = [('year', '==', 2016), ('pitcherFirstName', 'in', ['John', 'Doe']), ('pitcherLastName', 'in', ['Gant'])]
+            >>> df = bpd.read_gbq(
+            ...             "bigquery-public-data.baseball.games_wide",
+            ...             columns=columns,


In a future PR, let's make columns an alias for col_order.

Suggested change

... columns=columns,

... col_order=columns,

feat: Add filters argument to read_gbq for enhanced data querying

7a006b0

Genesis929 requested review from a team as code owners November 13, 2023 14:29

Genesis929 requested a review from shobsi November 13, 2023 14:29

product-auto-label bot added size: m Pull request size is medium. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Nov 13, 2023

Genesis929 and others added 2 commits November 13, 2023 14:38

feat: Add filters argument to read_gbq for enhanced data querying

37794a3

🦉 Updates from OwlBot post-processor

499bdcd

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

Genesis929 requested a review from tswast November 13, 2023 15:55

tswast requested changes Nov 13, 2023

View reviewed changes

tswast reviewed Nov 13, 2023

View reviewed changes

gcf-owl-bot bot added 2 commits November 14, 2023 01:34

feat: Add filters and columns arguments to read_gbq for enhanced data…

300263e

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

🦉 Updates from OwlBot post-processor

fdc539d

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

Genesis929 changed the title ~~feat: Add filters argument to read_gbq for enhanced data querying~~ feat: Add filters and columns arguments to read_gbq for enhanced data querying Nov 14, 2023

gcf-owl-bot bot and others added 6 commits November 14, 2023 01:53

feat: Add filters and columns arguments to read_gbq for enhanced data…

6ed4194

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

feat: Add filters and columns arguments to read_gbq for enhanced data…

ad6d37f

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

feat: Add filters and columns arguments to read_gbq for enhanced data…

3473780

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

feat: Add filters and columns arguments to read_gbq for enhanced data…

276bfd0

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

feat: Add filters and columns arguments to read_gbq for enhanced data…

8a4e940

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

Merge branch 'main' into b299514019-read-gbq-filter

b29e9b7

shobsi reviewed Nov 14, 2023

View reviewed changes

feat: Add filters and columns arguments to read_gbq for enhanced data…

c00a05e

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

Genesis929 requested review from tswast and shobsi November 14, 2023 17:50

shobsi reviewed Nov 17, 2023

View reviewed changes

gcf-owl-bot bot added 2 commits November 20, 2023 20:27

feat: Add filters and columns arguments to read_gbq for enhanced data…

dd94369

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

feat: Add filters and columns arguments to read_gbq for enhanced data…

54ca688

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

Genesis929 requested a review from shobsi November 20, 2023 20:36

🦉 Updates from OwlBot post-processor

95e318b

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

gcf-owl-bot bot added 2 commits November 20, 2023 21:22

feat: Add filters and columns arguments to read_gbq for enhanced data…

ced491f

… querying See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

🦉 Updates from OwlBot post-processor

0f2840d

See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md

tswast reviewed Dec 8, 2023

View reviewed changes

Merge branch 'main' into b299514019-read-gbq-filter

771d093

Genesis929 requested a review from tswast December 12, 2023 22:53

Genesis929 and others added 2 commits December 12, 2023 23:41

update docstring

82f74fd

Merge branch 'main' into b299514019-read-gbq-filter

1c038b5

tswast requested changes Dec 13, 2023

View reviewed changes

Genesis929 added 3 commits December 13, 2023 17:58

remove columns input

354fd8e

make filter_to_query run only when there are filters

434c559

remove named input

c17b815

tswast approved these changes Dec 13, 2023

View reviewed changes

tswast merged commit 034f71f into main Dec 13, 2023
15 checks passed

tswast deleted the b299514019-read-gbq-filter branch December 13, 2023 21:34

release-please bot mentioned this pull request Dec 13, 2023

chore(main): release 0.17.0 #269

Merged

	filters: Optional[List[Tuple]] = None,
	filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

	filters: Optional[List[Tuple]] = None
	filters: Iterable[Tuple[str, Literal["in", "not in", "<", "<=", "==", "!=", ">=", ">"], Any]] = (),

-        valid_operators = ["IN", "NOT IN", "=", ">", "<", ">=", "<=", "!="]
+        valid_operators = {
+            "in": "IN",
+            "not in": "NOT IN",
+            "==": "=",
+            ">": ">",
+            "<": "<",
+            ">=": ">=",
+            "<=": "<=",
+            "!=": "!=",
+        }


		where_clause = " WHERE " + " OR ".join(grouped_expressions)

		full_query = f"SELECT * FROM {sub_query} AS sub{where_clause}"

	expression = f"{column} {operator} ({value_list})"
	expression = f"`{column}` {operator} ({value_list})"

	[f'"{v}"' if isinstance(v, str) else str(v) for v in value]
	[repr(v) for v in value]

		value = f'"{value}"' if isinstance(value, str) else value
		expression = f"{column} {operator} {value}"

	value = f'"{value}"' if isinstance(value, str) else value
	expression = f"{column} {operator} {value}"
	expression = f"`{column}` {operator} {repr(value)}"

		if not isinstance(group, list):
		raise ValueError("Each filter group should be a list.")

		if not isinstance(filters, list):
		raise ValueError("Filters should be a list.")

	query_or_table = self._filters_to_query(query_or_table, columns, filters)
	query_or_table = self._filters_to_query(query_or_table, col_order, filters)

feat: Add filters and columns arguments to read_gbq for enhanced data querying #198

feat: Add filters and columns arguments to read_gbq for enhanced data querying #198

Conversation

Genesis929 commented Nov 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tswast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment