Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/4196 speed up search bundles #4273

Merged
merged 26 commits into from
Oct 15, 2022
Merged

Conversation

AndrewJGaut
Copy link
Contributor

@AndrewJGaut AndrewJGaut commented Oct 7, 2022

Reasons for making this change

The search_bundles function was slow primarily because it relied on nested subqueries. We updated the function to use inner joins instead, which should be much faster.

Related issues

Fixes #4196. Also addresses #3787 (see example in comment below for that function).

@percyliang
Copy link
Collaborator

Do you have any before/after timings that you can share?

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented Oct 7, 2022

Sure thing!

Examples of speedup (run on prod):

Command: cl search .shared data_size=.sum

Original query:
SELECT sum(q1.num) AS sum_1 FROM (SELECT DISTINCT bundle.uuid AS uuid, bundle_metadata_1.metadata_value * 1 AS num FROM bundle_metadata AS bundle_metadata_1, bundle INNER JOIN group_bundle_permission ON bundle.uuid = group_bundle_permission.object_uuid INNER JOIN user_group ON group_bundle_permission.group_uuid = user_group.group_uuid WHERE user_group.user_id='0' AND group_bundle_permission.permission >= 1 AND bundle.uuid = bundle_metadata_1.bundle_uuid AND bundle_metadata_1.metadata_key='data_size') AS q1;
7.52 seconds

Updated query:
SELECT sum(q1.num) AS sum_1 FROM (SELECT DISTINCT bundle.uuid AS uuid, bundle_metadata_1.metadata_value * 1 AS num FROM bundle_metadata AS bundle_metadata_1, bundle INNER JOIN group_bundle_permission ON bundle.uuid = group_bundle_permission.object_uuid INNER JOIN user_group ON group_bundle_permission.group_uuid = user_group.group_uuid WHERE user_group.user_id=’0’ AND group_bundle_permission.permission >= 1 AND bundle.uuid = bundle_metadata_1.bundle_uuid AND bundle_metadata_1.metadata_key='data_size') AS q1;
0.0 seconds

Command: cl search .after=2021-08-20 .count

Original query:
SELECT count(q2.uuid) AS tbl_row_count FROM (SELECT DISTINCT bundle.uuid AS uuid FROM bundle WHERE bundle.uuid IN (SELECT bundle_metadata.bundle_uuid FROM bundle_metadata WHERE bundle_metadata.metadata_key = ‘created’ AND bundle_metadata.metadata_value >= 1629417600) LIMIT 0, 18446744073709551615) AS q2;
27.74 seconds.

Updated query:
SELECT count(q1.uuid) AS tbl_row_count FROM (SELECT DISTINCT bundle.uuid AS uuid FROM bundle INNER JOIN bundle_metadata AS bundle_metadata_1 ON bundle.uuid = bundle_metadata_1.bundle_uuid WHERE bundle_metadata_1.metadata_key = ‘created’ AND bundle_metadata_1.metadata_value >= 1629417600 LIMIT 0, 18446744073709551615) AS q1
1.95 sec

@@ -0,0 +1,9 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need docker_config files? Or should we remove them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed them. Sorry about that!

@AndrewJGaut AndrewJGaut requested a review from teetone October 7, 2022 23:01
@teetone teetone requested a review from epicfaace October 7, 2022 23:09
@@ -0,0 +1,26 @@
"""Remove is_superuser from User table
Copy link
Collaborator

@percyliang percyliang Oct 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not part of speeding up search bundles, right? Ideally, would split the superuser thing into a different PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot. It is in a different PR that I submitted Thursday (still hasn't been merged b/c needs review). I must have branched off from my local branch I used for that PR when I created the branch for this PR. Sorry about that! Perhaps I should revert those changes in this branch?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can get that PR in first, since it's pretty straightforward, and then you don't have to undo anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll ping the chat for a review. Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be easiest to revert those changes in this branch, so it doesn't have to depend on that PR. Up to you though.

shortcuts = {'type': 'bundle_type', 'size': 'data_size', 'worksheet': 'host_worksheet'}

join_tables = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add type annotations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added for the two arrays I'm using (joins and where_clause_conjuncts).

When I try to add type annotations for some of the other variables (like limit), I get a strange error from mypy:
Deferral trace:
codalab.model.bundle_model:409
codalab.model.bundle_model:412
codalab.model.bundle_model:414
codalab.model.bundle_model:417
codalab.model.bundle_model:419
codalab.model.bundle_model:433
codalab.model.bundle_model:438
codalab.model.bundle_model:438
codalab.model.bundle_model:438
codalab/model/bundle_model.py: error: INTERNAL ERROR: maximum semantic analysis iteration count reached

So, I've left those un-typed for now, but I'll look into this.

cl_group_bundle_permission.c.permission >= GROUP_OBJECT_PERMISSION_READ,
)
)
add_join(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember there were some subtleties with joins when the joined tables (e.g., permissions) were empty, whereas for subqueries, it was much more straightforward to reason about, so would be good to double check the edge cases.

Copy link
Contributor Author

@AndrewJGaut AndrewJGaut Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good point. So, the queries I converted to inner joins above are all OK in this regard; I verified each of them individually. Basically, the subqueries (aside from the one you highlighted in a comment above and one other which I didn’t convert) are used as conjuncts in the final WHERE clause, aren’t negated, and require the existence of something in the table in the subquery to be true. In other words, each of those subqueries is automatically false if the table is empty anyway.

This is not true for the two other queries I didn't convert earlier (the one used for .floating and one used when the querying user is not root). For those, an inner join would not work for the reason you mentioned. For instance, for the negated query, it checks that the bundle_uuid is NOT in the Worksheet_Item table, and so using an INNER JOIN would be incorrect since the inner join would be empty even though the result we really want is for every bundle_uuid to be in the result since none belong to a worksheet (since worksheet_item is empty). For this query (and for the other I mentioned), we remedy this by using a LEFT OUTER JOIN instead of an INNER JOIN.

I verified that my changes work by adding test cases. I added some for .floating, and I added some cases for cl search queries used by the non-root user. I verified these work and that, as expected, they all fail when I switch the LEFT OUTER JOIN to an INNER JOIN.

Let me know if I didn't address your comment sufficiently and/or if I missed something. Thanks!

cl_bundle.c.uuid == aliased_bundle_metadata.c.bundle_uuid,
)

where_clause = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more accurate, I'd call this where_clause_disjuncts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! (I think they're actually conjuncts in this case since we use and_ to connect them in line 623, but please correct me if I'm wrong!)

)
clause = or_(*clause)

where_clause = []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like you can just write or_(cl_bundle..., ..., ...) directly instead of creating an array?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if where_clause is not None:
where_clauses.append(where_clause)

where_clause = and_(*where_clauses)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be confusing to reuse where_clause for both the top-level clause and the conjuncts. One might consider creating helper function(s) to break up this big function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

or_( # Join constraint (group)
cl_group_bundle_permission.c.group_uuid
== self.public_group_uuid, # Public group
cl_group_bundle_permission.c.group_uuid.in_(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't replace this subquery with a join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. But see the comment above about INNER JOIN vs LEFT OUTER JOIN. I left this and the other subquery in the .floating case because they were harder to convert and subsequently forgot to convert them. Sorry about that!

where_clause = and_(where_clause, or_(access_via_owner, access_via_group))

table = cl_bundle
for join_table, join_condition in zip(join_tables, join_conditions):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using parallel arrays, which might be error prone, consider making a new dataclass with two fields (the table and the join).

Copy link
Contributor Author

@AndrewJGaut AndrewJGaut Oct 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a high-level, logic looks good, and it's great that we get such dramatic speedups! I'd just be careful about the edge cases, and we should probably add some test cases for this. Consider also splitting up the function into more modular pieces (something we've wanted to do for a while) if you think it makes sense.

@AndrewJGaut
Copy link
Contributor Author

AndrewJGaut commented Oct 9, 2022

Thanks so much for the thorough review! I added more test cases to address the edge cases you mentioned (see my comment about INNER JOIN and LEFT OUTER JOIN above).

In terms of splitting the function, I do think that makes sense and gave it some thought throughout the day. However, I'm still not sure how to do it. I'll keep thinking, but in the meantime I addressed your other comments. Thanks!

@epicfaace
Copy link
Member

0.0 seconds

Amazing!!

subquery_index = [0]

@dataclass
class JoinTable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this class declaration outside the search_bundles function? Eventually (in a later PR probably!) we'll need to do the same thing as this PR for the search_worksheets and will end up using JoinTable for that, too (#4277).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On that point, it might be worth thinking about whether it makes sense to abstract this code out into something more general, so that it could be implemented in search_worksheets (without having to copy and paste everything again). But not necessary for this PR, but if you do think of a good way to do it that works well in this PR, do add it in!

Copy link
Contributor Author

@AndrewJGaut AndrewJGaut Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have a potential idea for abstracting the logic, but I think it might be better to do it in a separate PR. The issue is that the there aren't many shared keys between search_worksheets and search_bundles (aside from the "special" keywords at the top of the function, which can be abstracted quite easily) and even some shared keys are dealt with a little differently (e.g. the key ''), which makes it hard to abstract.

table: Table
condition: Any
left_outer_join: bool = False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! However, I think it's a bit difficult to understand how the code is working by just reading through it. Can you add a few sentences at the beginning of the method that describes the general approach this code is taking (and the role played by where_class_conjuncts, add_join, etc.)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

check_equals('', _run_command([cl, 'search', '.shared']))
# Check search with non-root user.
if not os.getenv('CODALAB_PROTECTED_MODE'):
# This test does not work when protected_mode is True.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running cl search with non-root user yields a 401 Unauthorized when in protected mode. I could solve this also by using 'cl uedit --grant-access' with the root user before switching to the non-root user instead, though. Perhaps that's better.

Copy link
Collaborator

@percyliang percyliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@AndrewJGaut AndrewJGaut merged commit efd99ac into master Oct 15, 2022
@AndrewJGaut AndrewJGaut deleted the fix/4196-speed-up-search-bundles branch October 15, 2022 21:12
@wwwjn wwwjn mentioned this pull request Oct 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

search_bundles is extremely slow
4 participants